- 分享主题：Transformer for Time-Series Forecasting - 论文标题：Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting - 论文链接：https://arxiv.org/pdf/2012.07436.pdf - 分享人：唐共勇

1. Summary

【必写】，推荐使用 grammarly 检查语法问题，尽量参考论文 introduction 的写作方式。需要写出

这篇文章解决了什么问题？
作者使用了什么方法（不用太细节）来解决了这个问题？
你觉得你需要继续去研究哪些概念才会加深你对这篇文章的理解？

Long series time series prediction (LSTF) is a problem compared with short-term time series prediction. Its difficulty is that it needs to predict a long time in the future. Predicting tomorrow is not difficult, but if you want to predict the next month, the difficulty of the task is very different. In other words, the longer the prediction sequence length, the higher the prediction difficulty. Aiming at the problem of long series time prediction, this paper proposes to use the transformer model. However, the transformer has some problems. Aiming at these problems, this paper proposes an informer, which solves the problem of transformer applied to long series time series prediction. Informer has three remarkable characteristics: it adopts probspares self-attention mechanism to reduce the complexity of the original O(L ^ 2) to O (L * log (L)) ; The self-attention refining (distillation) mechanism is used to shorten the input sequence length of each layer, solve the memory bottleneck of the long input stack layer of the transformer, and make the model have good scalability when receiving long input; The generative decoder is used to get the result in one step when predicting the sequence, rather than step by step, which directly reduces the time complexity of prediction from O (n) to O (1) . In general, the improvement made by an informer to the transformer in the field of time series prediction has a good performance in the prediction of long series time series.

2. 你对于论文的思考

需要写出你自己对于论文的思考，例如优缺点，你的takeaways

优点：
- 提出的概率自注意力和自注意力蒸馏机制解决了Transformer应用于LSTF的一些局限和问题
- 生成式的编码器打破了传统编码器的局限，能够一次给出结果
- 相较于LSTM，通过实验验证了类Transformer模型在LSTF上可以取得较好的效果
缺点：
- 概率自注意力机制只能用于自注意力的计算，不能用于交叉注意力，而且在有masking的情况下也不能用。而且虽然该机制可以减少复杂度，但它增加了排序的过程，计算时间不一定减少。
  3. 其他
  
  【可选】

概率注意力
- 为了减少自注意的时间复杂性，作者引入了概率注意。与传统的O(L^2)相比，这种概率注意力机制实现了O(L*log (L))复杂度。传统的自注意存在这样的问题:只有少数k、v对注意力分数起主要作用。这意味着大多数计算出来的点积实际上毫无价值。稀疏注意力 ProbSparse允许每个k只关注主要的查询q，而不是所有的查询。这使模型仅能为查询的一小部分计算进行昂贵的运算。特别是ProbSparse机制还具有一个系数，可以指定预测。该系数控制着减少注意力计算的程度。
- 其主要计算流程如下：
  - 为每个query都随机采样部分的Key，默认值是 5*lnL
  - 计算每个query的稀疏性得分
  - 选择稀疏性分数最高的 N 个query，N 的默认值是 5*lnL
  - 只计算 N 个query和所有Key的点积结果，从而得到attention的结果
  - 其余的 L-N 个query不计算，直接将self-attention层的输入取均值（mean(V)）作为输出，保证每一个ProbSparse self-attention层的输入和输出序列长度都是 L
编码器的自注意力蒸馏提炼
- 与传统的Transformer不同，在概率自注意力中，一些信息是冗余的，因此在编码器中采用的卷积+池化的操作，减少冗余信息。

Informer

1. Summary

2. 你对于论文的思考

3. 其他