论文精读 - Transformer - 《机器学习》

Abstract
1.Introduction
2.Backgroud
3.Model Architecture
- 3.1 Encoder and Decoder Stacks
7.Conclusion

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. （这是之前的架构）We propose a new simple network architecture（作者提出了一个简单的架构）, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions （仅仅依赖于注意力机制，没有用循环和卷积。）
entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signifificantly less time to train.（并行度更高，耗时更少） Our model achieves 28.4 BLEU on the WMT 2014 English to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

自己读的时候，我其实并抓不到We propose a new simple network architecture 其实主要原因在于，从我的视角来看，整篇文章是扁平的，不是立体的，我不知道那些话是重点。其实换位思考一下，如果作者写了一篇文章肯定是有一些创新点，或者做了一个比较好的实验结果。如果是提出了自己的观点，那么原来，原来最好的东西是什么，就变得很重要了。所以在作者介绍自己的东西之前，应该会先介绍一下，其他人的作品或者研究成果。

1.Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been fifirmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Recurrent Neural Networks = RNN Recurrent neural networks, long short-term memory, gated recurrent neural networks 都是时序模型里面最好的方案。在2017年的时候——因为文章是在2017年写的。（作者这样表达信息，很容易让我不知道它在干嘛，先说了三个名词，我都还没意识到这三个方法是干嘛用的，也有可能是因为我是刚刚学机器学习。主要是这里我不知道RNN LSTM GRN是方法，虽然后面有approaches 但是这读下来也很消耗我脑力的。

Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−_1 and the input for position _t.

这里介绍了一下RNN的工作特点，RNN的一大特点是不能并行运行。RNN的计算是把序列从左往右一步一步计算的，只有算了前面的得出hidden states ht才能计算出下一个状态。当前的状态是由当前的输入和历史的输入决定的，这也是RNN能够有效处理时序信息的关键所在。因为是时序的运算，所以RNN比较难以并行。因为信息是一步一步向后传递的，如果我们的时序比较长，那么我们很早期的时序信息可能在后面会被丢掉。

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved signifificant improvements in computational effificiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduc
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.

Attention在RNN上的应用。Attention的主要是应用在把编码器的东西传给解码器

In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for signifificantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.

现在作者提出了Transformer模型，不再使用RNN模型，创新点是现在的模型可以并行计算。

2.Backgroud

相关工作的章节

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models,
the number of operations required to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
it more diffificult to learn dependencies between distant positions [12]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.

第一段先讲如何使用卷积神经网络来替换掉循环神经网络。但是用卷积神经网络比较难对长的序列做建模，因为卷积每次只用3x3的一个小窗口，要卷积了比较多层才能把离得比较远的两个像素融合起来。但使用了注意力机制之后，一层就能把整个序列看到。但是卷积比较好的地方在于，卷积可以做多个输出通道，所以作者采用了Multi-Head Attention机制来模拟卷积的多通道效果。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence
aligned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].

Transformer是第一依赖于自注意力机制来做encoder decoder的架构的模型。

3.Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
Here, the encoder maps an input sequence of symbol representations (x_1, …, xn) to a sequence
of continuous representations z = (_z_1, …, zn_).

如果我们有一个句子，有n个词，那么xt就代表弟t个词。编码器会把这样一个长为n的序列转化为一个也是长为n的z序列，只是序列中的每一个zt都是一个向量。

Given z, the decoder then generates an output sequence (y_1, …, ym_) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

对于解码器而言，decoder会拿到编码器的输出，然后转换到一个长为m的序列，这里n和m是不一样的。编码器一次性可以看到全部的序列，但是解码器只能一个一个的输出m，这里还用到了一个auto-regression的模型，自回归，自己的输出也是自己的输入。过去时刻的输出也是当前时刻的输入。

现在的序列模型里面，用的比较好的是一个叫encoder-decoder的架构。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.

3.1 Encoder and Decoder Stacks

Encoder:
The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension _d_model = 512.

作者把一个这样的模块叫做一个layer，这个模型里面有6个layer，每个layer里面有两个sublayer，第一个layer叫multi-head self attention machanism，（这里的self attention machanism已经出现了很多次了，但是在这里还没解释，作者在之后才会解释），第二个子层用的是MLP。作者对每一个子层用了一个残差连接，最后在使用一个layer normalization的东西，因为残差连接要做projection，所以作者把输入和输出都设定为512. 这个设计有个好处，这个模型参数比较少，所以只需要两个超参数就可以解决。调参只需要调两个参数。 LayerNorm Encoder里面用了注意力机制。

Decoder:
The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i _can depend only on the known outputs at positions less than _i.

为了实现自回归，这里的decoder的Multi-Head Attention用了一个Masked，这个Masked 是这样的一个效果，使b1的输出只与a1有关，b2的输出只与a1和a2有关。

7.Conclusion

In this work, we presented the Transformer, the fifirst sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained signifificantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks, we achieve a new state of the art. In the former task our best
model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to apply them to other tasks. We
plan to extend the Transformer to problems involving input and output modalities other than text and
to investigate local, restricted attention mechanisms to effificiently handle large inputs and outputs
such as images, audio and video. Making generation less sequential is another research goals of ours.
The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

作者介绍了transformer这个模型，这是第一个做序列转录但仅仅使用了attention的一个模型。作者把之前所有的recurrent layers全部换成了multi-headed self-attention For translation tasks,the Transformer can be trained significiantly faster than architectures based on recurrent or convolutional layers.在机器翻译的任务上，transformer会比其他的架构训练的更快。并且取得了更好的实践结果。作者对纯注意力机制感到非常激动，想把transformer用在一些其他的任务上。当然了现在在图像处理领域也经常用到transformer