在Wide&Deep之后,诸多模型延续了双网络组合的结构,DeepFM就是其中之一。DeepFM对Wide&Deep的改进之处在于,它用FM替换掉了原来的Wide部分,加强了浅层网络部分特征组合的能力。事实上,由于FM本身就是由一阶部分和二阶部分组成的,DeepFM相当于同时组合了原Wide部分+二阶特征交叉部分+Deep部分三种结构,无疑进一步增强了模型的表达能力。
DeepFM.jpg
为了学习多阶交叉特征,DeepFM由FM和Deep两部分组成,两部分共用相同的输入,模型预测输出为

DeepFM - 图2

其中DeepFM - 图3,是预测CTR的结果,DeepFM - 图4是FM部分的输出,DeepFM - 图5是Deep部分的输出。

FM部分

FM.png
FM是由一阶部分和二阶部分组成的,其输出是一个Addition unit和一系列Inner Product units的和:

DeepFM - 图7
其中DeepFM - 图8(DeepFM - 图9为特征one hot后的维度数),DeepFM - 图10(DeepFM - 图11为设置的Dense Embedding的维度数)。Addition unit(DeepFM - 图12)表达一阶特征,Inner Product表达二阶。

Deep部分

DNN.png
Deep部分用来学习高阶特征,其输入是连续稠密向量,由原始的超高维稀疏且具有组(field)属性(性别、地点、年龄等)数据由Dense Embedding层映射至低维稠密向量。

Dense Embedding

embedding.png
如上图所示,embedding层的输出为:

DeepFM - 图15

其中DeepFM - 图16为第i个特征向量,DeepFM - 图17为特征类别数(比如只有年龄,性别,地点3种类特征则m=3)。

Code

  1. # deepFM.py
  2. import tensorflow as tf
  3. from tensorflow.python.ops import embedding_ops
  4. from tensorflow.python.layers import normalization
  5. class EmbeddingTable:
  6. def __init__(self):
  7. self._weights = {}
  8. def add_weights(self, vocab_name, vocab_size, embed_dim):
  9. """
  10. :param vocab_name: 一个field拥有两个权重矩阵,一个用于线性连接,另一个用于非线性(二阶或更高阶交叉)连接
  11. :param vocab_size: 字典总长度
  12. :param embed_dim: 二阶权重矩阵shape=[vocab_size, order2dim],映射成的embedding, 既用于接入DNN的第一屋,也是用于FM二阶交互的隐向量
  13. """
  14. linear_weight = tf.get_variable(name='{}_linear_weight'.format(vocab_name),
  15. shape=[vocab_size, 1],
  16. initializer=tf.glorot_normal_initializer(),
  17. dtype=tf.float32)
  18. # 二阶(FM)与高阶(DNN)的特征交互,共享embedding矩阵
  19. embed_weight = tf.get_variable(name='{}_embed_weight'.format(vocab_name),
  20. shape=[vocab_size, embed_dim],
  21. initializer=tf.glorot_normal_initializer(),
  22. dtype=tf.float32)
  23. self._weights[vocab_name] = (linear_weight, embed_weight)
  24. def get_linear_weights(self, vocab_name): return self._weights[vocab_name][0]
  25. def get_embed_weights(self, vocab_name): return self._weights[vocab_name][1]
  26. def build_embedding_table(params):
  27. embed_dim = params['embed_dim'] # 必须有统一的embedding长度
  28. embedding_table = EmbeddingTable()
  29. for vocab_name, vocab_size in params['vocab_sizes'].items():
  30. embedding_table.add_weights(vocab_name=vocab_name, vocab_size=vocab_size, embed_dim=embed_dim)
  31. return embedding_table
  32. def output_logits_from_linear(features, embedding_table, params):
  33. field2vocab_mapping = params['field_vocab_mapping']
  34. combiner = params.get('multi_embed_combiner', 'sum')
  35. fields_outputs = []
  36. # 当前field下有一系列的<tag:value>对,每个tag对应一个bias(待优化),
  37. # 将所有tag对应的bias,按照其value进行加权平均,得到这个field对应的bias
  38. for fieldname, vocabname in field2vocab_mapping.items():
  39. sp_ids = features[fieldname + "_ids"]
  40. sp_values = features[fieldname + "_values"]
  41. linear_weights = embedding_table.get_linear_weights(vocab_name=vocabname)
  42. # weights: [vocab_size,1]
  43. # sp_ids: [batch_size, max_tags_per_example]
  44. # sp_weights: [batch_size, max_tags_per_example]
  45. # output: [batch_size, 1]
  46. output = embedding_ops.safe_embedding_lookup_sparse(linear_weights, sp_ids, sp_values,
  47. combiner=combiner,
  48. name='{}_linear_output'.format(fieldname))
  49. fields_outputs.append(output)
  50. # 因为不同field可以共享同一个vocab的linear weight,所以将各个field的output相加,会损失大量的信息
  51. # 因此,所有field对应的output拼接起来,反正每个field的output都是[batch_size,1],拼接起来,并不占多少空间
  52. # whole_linear_output: [batch_size, total_fields]
  53. whole_linear_output = tf.concat(fields_outputs, axis=1)
  54. tf.logging.info("linear output, shape={}".format(whole_linear_output.shape))
  55. # 再映射到final logits(二分类,也是[batch_size,1])
  56. # 这时,就不要用任何activation了,特别是ReLU
  57. return tf.layers.dense(whole_linear_output, units=1, use_bias=True, activation=None)
  58. #return tf.layers.dense(whole_linear_output, units=6, use_bias=True, activation=None)
  59. def output_logits_from_bi_interaction(features, embedding_table, params):
  60. field2vocab_mapping = params['field_vocab_mapping']
  61. # 论文上的公式就是要求sum,而且我也试过mean和sqrtn,都比用mean要差上很多
  62. # 但是,这种情况,仅仅是针对criteo数据的,还是理论上就必须用sum,而不能用mean和sqrtn
  63. # 我还不太确定,所以保留一个接口能指定其他combiner的方法
  64. combiner = params.get('multi_embed_combiner', 'sum')
  65. # 见《Neural Factorization Machines for Sparse Predictive Analytics》论文的公式(4)
  66. fields_embeddings = []
  67. fields_squared_embeddings = []
  68. for fieldname, vocabname in field2vocab_mapping.items():
  69. sp_ids = features[fieldname + "_ids"]
  70. sp_values = features[fieldname + "_values"]
  71. # --------- embedding
  72. embed_weights = embedding_table.get_embed_weights(vocabname)
  73. # embedding: [batch_size, embed_dim]
  74. embedding = embedding_ops.safe_embedding_lookup_sparse(embed_weights, sp_ids, sp_values,
  75. combiner=combiner,
  76. name='{}_embedding'.format(fieldname))
  77. fields_embeddings.append(embedding)
  78. # --------- square of embedding
  79. squared_emb_weights = tf.square(embed_weights)
  80. squared_sp_values = tf.SparseTensor(indices=sp_values.indices,
  81. values=tf.square(sp_values.values),
  82. dense_shape=sp_values.dense_shape)
  83. # squared_embedding: [batch_size, embed_dim]
  84. squared_embedding = embedding_ops.safe_embedding_lookup_sparse(squared_emb_weights, sp_ids, squared_sp_values,
  85. combiner=combiner,
  86. name='{}_squared_embedding'.format(fieldname))
  87. fields_squared_embeddings.append(squared_embedding)
  88. # calculate bi-interaction
  89. sum_embedding_then_square = tf.square(tf.add_n(fields_embeddings)) # [batch_size, embed_dim]
  90. square_embedding_then_sum = tf.add_n(fields_squared_embeddings) # [batch_size, embed_dim]
  91. bi_interaction = 0.5 * (sum_embedding_then_square - square_embedding_then_sum) # [batch_size, embed_dim]
  92. tf.logging.info("bi-interaction, shape={}".format(bi_interaction.shape))
  93. # calculate logits
  94. logits = tf.layers.dense(bi_interaction, units=1, use_bias=True, activation=None)
  95. #logits = tf.layers.dense(bi_interaction, units=6, use_bias=True, activation=None)
  96. # 因为FM与DNN共享embedding,所以除了logits,还返回各field的embedding,方便搭建DNN
  97. return logits, fields_embeddings
  98. def output_logits_from_dnn(fields_embeddings, params, is_training):
  99. dropout_rate = params['dropout_rate']
  100. do_batch_norm = params['batch_norm']
  101. X = tf.concat(fields_embeddings, axis=1)
  102. tf.logging.info("initial input to DNN, shape={}".format(X.shape))
  103. for idx, n_units in enumerate(params['hidden_units'], start=1):
  104. X = tf.layers.dense(X, units=n_units, activation=tf.nn.relu)
  105. tf.logging.info("layer[{}] output shape={}".format(idx, X.shape))
  106. X = tf.layers.dropout(inputs=X, rate=dropout_rate, training=is_training)
  107. if is_training:
  108. tf.logging.info("layer[{}] dropout {}".format(idx, dropout_rate))
  109. if do_batch_norm:
  110. # BatchNormalization的调用、参数,是从DNNLinearCombinedClassifier源码中拷贝过来的
  111. batch_norm_layer = normalization.BatchNormalization(momentum=0.999, trainable=True,
  112. name='batchnorm_{}'.format(idx))
  113. X = batch_norm_layer(X, training=is_training)
  114. if is_training:
  115. tf.logging.info("layer[{}] batch-normalize".format(idx))
  116. # connect to final logits, [batch_size,1]
  117. #connect to final logits, [batch_size,6] 6类
  118. #return tf.layers.dense(X, units=1, use_bias=True, activation=None)
  119. return tf.layers.dense(X, units=1, use_bias=True, activation=None)
  120. def model_fn(features, labels, mode, params):
  121. for featname, featvalues in features.items():
  122. if not isinstance(featvalues, tf.SparseTensor):
  123. raise TypeError("feature[{}] isn't SparseTensor".format(featname))
  124. # ============= build the graph
  125. embedding_table = build_embedding_table(params)
  126. linear_logits = output_logits_from_linear(features, embedding_table, params)
  127. bi_interact_logits, fields_embeddings = output_logits_from_bi_interaction(features, embedding_table, params)
  128. dnn_logits = output_logits_from_dnn(fields_embeddings, params, (mode == tf.estimator.ModeKeys.TRAIN))
  129. general_bias = tf.get_variable(name='general_bias', shape=[1], initializer=tf.constant_initializer(0.0))
  130. logits = linear_logits + bi_interact_logits + dnn_logits
  131. logits = tf.nn.bias_add(logits, general_bias) # bias_add,获取broadcasting的便利
  132. # reshape [batch_size,1] to [batch_size], to match the shape of 'labels'
  133. logits = tf.reshape(logits, shape=[-1])
  134. probabilities = tf.sigmoid(logits)
  135. # ============= predict spec
  136. if mode == tf.estimator.ModeKeys.PREDICT:
  137. return tf.estimator.EstimatorSpec(
  138. mode=mode,
  139. predictions={'uids':tf.sparse_tensor_to_dense(features['uids']),'probabilities': probabilities})
  140. # ============= evaluate spec
  141. # 这里不设置regularization,模仿DNNLinearCombinedClassifier的做法, L1/L2 regularization通过设置optimizer=
  142. # tf.train.ProximalAdagradOptimizer(learning_rate=0.1,
  143. # l1_regularization_strength=0.001,
  144. # l2_regularization_strength=0.001)来实现
  145. # STUPID TENSORFLOW CANNOT AUTO-CAST THE LABELS FOR ME
  146. loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
  147. #predictions=tf.greater_equal(probabilities,0.5)
  148. eval_metric_ops = {'auc': tf.metrics.auc(labels, probabilities)}
  149. if mode == tf.estimator.ModeKeys.EVAL:
  150. return tf.estimator.EstimatorSpec(
  151. mode=mode,
  152. loss=loss,
  153. eval_metric_ops=eval_metric_ops)
  154. # ============= train spec
  155. assert mode == tf.estimator.ModeKeys.TRAIN
  156. train_op = params['optimizer'].minimize(loss, global_step=tf.train.get_global_step())
  157. return tf.estimator.EstimatorSpec(mode,
  158. loss=loss,
  159. train_op=train_op,
  160. eval_metric_ops=eval_metric_ops)

Source