Attention based Multi-Modal New Product Sales Time-series Forecasting

1.Summary
2.你对于论文的思考
3. 其他

分享主题：New Product Forecast、Multi-Modal
论文标题：Attention based Multi-Modal New Product Sales Time-series Forecasting
论文链接：https://dl.acm.org/doi/pdf/10.1145/3394486.3403362
1.Summary
This paper is about new product prediction. The data set used is about fashion clothing (non-public). The prediction granularity is weekly. It is necessary to predict the new product sales in the next 14 weeks. This paper uses the Encoder-Decoder method. Firstly, the product’s own features (image and product attributes) and temporal features are encoded, and then RNN (GRU) is used to decode to obtain the sales volume at a certain time. For different input features and encoder methods, three methods are used: (1) Image RNN, (2) Multi-modal RNN, (3) Explainable Cross Attention RNN, in which Image RNN only uses the image and temporalfeatures of the product as input, and the latter two add various attributes of the product, which is a Multi-modal embedding using both image embedding and text embedding. Compared with Multi-modal RNN, Explainable Cross Attention RNN adds Temporal Self-Attention and Cross Attention. Temporal Self-Attention applies weight to time features, while Cross Attention applies weight to image embedding, text embedding and temporal embedding. In order to deepen my understanding of this paper, I should deeply study the Attention and read some papers of Multi-modal embedding.
2.你对于论文的思考
这篇文章的创新点在于使用了使用了多模态表征，即同时使用了image embedding（产品图像）、text embedding（产品的各种属性），同时增加了一些外部因素（如假日、折扣等），使得输入的数据比较丰富。还有一个创新点是使用了基于图像的新产品时间序列预测的Encoder-Decoder模型，这是首次尝试在基于图像的新产品时间序列预测中使用这个模型。此外，文中的模型充分利用了Attention机制，使得模型效果提升，并且模型的可解释性更强（可以查看在添加了Attention的地方的每一部分的权重）。为了避免传统的寻找相似品的KNN方法找不到合适的相似品，文中使用的数据集是包括分布在45个类别中的10290种产品，数据足够丰富，在这种情况下，文中提出的众多模型（除了Image RNN中的Standard-Image RNN）仍击败了KNN方法。
文中有个实验用各种方法预测了新品上市后12周的销量，然后第十周以后，所有的模型表现得非常差，这可能是由于长时间的预测会使得预测结果越来越差，也可能是由于文中的各个模型都没有考虑季节性因素，由于换季的原因使得产品的销量出现异常波动。因此，文章中的模型输入特征中可以加入季节性因素，也许会缓解这种情况。
3. 其他
问题定义

其中：
I：产品的图像
x：产品的属性
y(t)：产品的销量

此外，由于还需考虑外部因素：

最终：

baseline
即利用KNN寻找相似品，利用各个相似品的相似度加权求出新品销量。
1.能直接进行计算的属性：Attribute KNN

2.产品描述、图像等不能直接进行计算的属性：Embedding KNN

文中使用了如下三个baseline：
1.Category KNN （使用产品类别）
2.Color + Category KNN （使用产品颜色和产品类别）
3.Embedding KNN （使用产品图片）
文中使用的方法
下面三种模型的Decoder是一样的，都是RNN。
Image RNN
文中提出了两个Image RNN的模型，它们的输入都是一样的：产品的图片和时间特征。
（1） Attended-Image RNN
先利用注意力机制对image embedding进行处理，再把image embedding和temporal embedding进行拼接，作为Encoder的输出。

（2）Standard-Image RNN
直接对image embedding和temporal embedding进行拼接，作为Encoder的输出。
Multi-modal RNN
使用了产品的图片、属性和时间特征，先对image embedding和text embedding进行多模态表征，再把这个embedding和temporal embedding联合（拼接或者直接相加）之后作为Encoder的输出。
（1）Concat Multi-modal RNN
对image embedding和text embedding的多模态表征与temporal embedding进行拼接。

两个表征的联合方法：

（2）Residual Multi-modal RNN
对image embedding和text embedding的多模态表征与temporal embedding进行相加（矩阵加法）。

两个表征的联合方法：

Cross-Attention RNN
与Multi-modal RNN的不同之处是，这里使用了：
（1）Temporal Self-Attention：对temporal embedding使用了自注意力；
（2）Cross Attention：对temporal embedding、image embedding、text embedding使用了跨注意力，对这三者施以不同的权重。

实验结果
baseline、Standard-Image RNN、Attended-Image RNN的对比试验：

所有产品上的平均效果：

image embedding可能存在一些噪声，该噪声未被RNN正确过滤掉，这对Standard-Image RNN的表现产生很大的影响，效果甚至不如Embedding KNN；而Attended-Image RNN使用了注意力机制，效果是最好的。

Attended-Image RNN、Concat Multi-modal RNN、Residual Multi-modal RNN、Cross-Attention RNN 的对比试验：

所有产品上的平均效果：

Attended-Image RNN的效果最差，可能是因为多模态表征的效果要好于单个模态的表征；Concat Multi-modal RNN的Encoder输出向量的维度比较大，可能存在收敛问题，因此效果不如Residual Multi-modal RNN，效果只略好于Attended-Image RNN；Residual Multi-modal RNN和Cross-Attention RNN的效果相当。

Cross-Attention RNN的可解释性实验：
Q：为什么产品的销量在第四周达到了峰值？
A：因为第四周的holiday有2天，对销量的影响大。
因此，在Temporal Self-Attention中，Holiday 权重大；在Cross Attention中，Temporal features 权重大。

加长预测的时间跨度实验:
随着预测时间的加长，各种方法的预测效果都会变差，可能是受到季节性因素的影响。

1.Summary

2.你对于论文的思考

3. 其他

问题定义

baseline

文中使用的方法

Image RNN

Multi-modal RNN

Cross-Attention RNN

实验结果