- 参考文献
Self-Attention with Relative Position Representations (Shaw et al.2018): https://arxiv.org/pdf/1803.02155.pdf
- 公式
普通的self-attention输出:
引入两个与相对位置有关的向量
假设如果序列中两个元素的距离超过,认为这两个元素之间位置信息无意义。将
定义为可训练的向量
公式可以进行如下高效实现:
代码
def _relative_attention_inner(x, y, z, transpose):"""Relative position-aware dot-product attention inner calculation.This batches matrix multiply calculations to avoid unnecessary broadcasting.Args:x: Tensor with shape [batch_size, heads, length or 1, length or depth].y: Tensor with shape [batch_size, heads, length or 1, depth].z: Tensor with shape [length or 1, length, depth].transpose: Whether to transpose inner matrices of y and z. Should be true iflast dimension of x is depth, not length.Returns:A Tensor with shape [batch_size, heads, length, length or depth]."""batch_size = tf.shape(x)[0]heads = x.get_shape().as_list()[1]length = tf.shape(x)[2]# xy_matmul is [batch_size, heads, length or 1, length or depth]xy_matmul = tf.matmul(x, y, transpose_b=transpose)# x_t is [length or 1, batch_size, heads, length or depth]x_t = tf.transpose(x, [2, 0, 1, 3])# x_t_r is [length or 1, batch_size * heads, length or depth]x_t_r = tf.reshape(x_t, [length, heads * batch_size, -1])# x_tz_matmul is [length or 1, batch_size * heads, length or depth]x_tz_matmul = tf.matmul(x_t_r, z, transpose_b=transpose)# x_tz_matmul_r is [length or 1, batch_size, heads, length or depth]x_tz_matmul_r = tf.reshape(x_tz_matmul, [length, batch_size, heads, -1])# x_tz_matmul_r_t is [batch_size, heads, length or 1, length or depth]x_tz_matmul_r_t = tf.transpose(x_tz_matmul_r, [1, 2, 0, 3])return xy_matmul + x_tz_matmul_r_t
