Attention
注意力机制,允许模型按需聚焦在输入序列的相关部分
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Transformer
http://jalammar.github.io/illustrated-transformer/
损失函数
交叉熵 https://colah.github.io/posts/2015-09-Visual-Information/
相对熵 https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Follow-up works:
Attention Is All You Need
- Depthwise Separable Convolutions for Neural Machine Translation
- One Model To Learn Them All
- Discrete Autoencoders for Sequence Models
- Generating Wikipedia by Summarizing Long Sequences
- Image Transformer
- Training Tips for the Transformer Model
- Self-Attention with Relative Position Representations
- Fast Decoding in Sequence Models using Discrete Latent Variables
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost