image.png

从摘要理解论文

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice.

这里指出了self-attention结构较高的计算成本。

There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer.

引出本文的核心,MLP架构。

However, the only inductive bias in this architecture is the embedding of tokens.

在MLP架构中,唯一引入归纳偏置的位置也就是token嵌入的过程。 这里提到归纳偏置在我看来主要是为了向原始的纯MLP架构中引入更多的归纳偏置来在视觉任务上实现更好的训练效果。估计本文又要从卷积架构中借鉴思路了

Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas.

这里主要在强调虽然引入了归纳偏置,但并不是通过卷积结构引入的。那就只能通过对运算过程进行约束来实现了。

  1. A way is to divide the token-mixing block vertically and horizontally.
  2. Another way is to make spatial correlations denser among some channels of token-mixing.

    这里又一次出现了使用垂直与水平方向对计算进行划分的思路。类似的思想已经出现在很多方法中,例如:

    这里的第二点暂时不是太直观,看起来时对通道MLP进行了改进?

With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity.

毕竟因为分治的策略,将原本凑在一起计算的全连接改成了沿特定轴向的级联处理。 粗略来看,这近似使得参数量从RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图2变成了RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图3

Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

主要内容

image.png
可以看到,实际上还是可以看作是对空间MLP的调整。

这里将原始的空间与通道MLP交叉堆叠的结构修改为了垂直、水平、通道三个级联的结构。通过这样的方式,作者们期望可以引入垂直和水平方向上的属于2D图像的有意义的归纳偏置,隐式地假设水平或者垂直对齐的patch序列有着和其他的水平或垂直对齐的patch序列有着相似的相关性。此外,在输入到垂直混合块和水平混合块之前,一些通道被连接起来,它们被这两个模块共享。这样做是因为作者们假设某些通道之间存在几何关系(后文将整合得到的这些通道称作Channel Raft,并且假定的是特定间隔RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图5的通道具有这样的关系)。
image.png
Vertical-Mixing Block的索引形式变化过程:((rhrwsr,h,w) -> (sr, rhh, rww) <=> (rwsrw, rh*h) (因为这里是通道和水平方向共享,所以可以等价,而图中绘制的是等价符号左侧的形式),Horizontal-Mixing Block类似。
针对水平和垂直模块构成的Raft-Token-Mixing Block,作者给出的代码示例和我上面等式中等价符号右侧内容一致。从代码中可以看到,其中的归一化操作不受通道分组的影响,而直接对原始形式的特征的通道处理。

  1. class RaftTokenMixingBlock(nn.Module):
  2. # b: size of mini -batch, h: height, w: width,
  3. # c: channel, r: size of raft (number of groups), o: c//r,
  4. # e: expansion factor,
  5. # x: input tensor of shape (h, w, c)
  6. def __init__(self):
  7. self.lnv = nn.LayerNorm(c)
  8. self.lnh = nn.LayerNorm(c)
  9. self.fnv1 = nn.Linear(r * h, r * h * e)
  10. self.fnv2 = nn.Linear(r * h * e, r * h)
  11. self.fnh1 = nn.Linear(r * w, r * w * e)
  12. self.fnh2 = nn.Linear(r * w * e, r * w)
  13. def forward(self, x):
  14. """
  15. x: b, hw, c
  16. """
  17. # Vertical-Mixing Block
  18. y = self.lnv(x)
  19. y = rearrange(y, 'b (h w) (r o) -> b (o w) (r h)')
  20. y = self.fcv1(y)
  21. y = F.gelu(y)
  22. y = self.fcv2(y)
  23. y = rearrange(y, 'b (o w) (r h) -> b (h w) (r o)')
  24. y = x + y
  25. # Horizontal-Mixing Block
  26. y = self.lnh(y)
  27. y = rearrange(y, 'b (h w) (r o) -> b (o h) (r w)')
  28. y = self.fch1(y)
  29. y = F.gelu(y)
  30. y = self.fch2(y)
  31. y = rearrange(y, 'b (o h) (r w) -> b (h w) (r o)')
  32. return x + y

对于提出的结构,通过选择合适的RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图7可以让最终的raft-token-mixing相较于原始的token-mixing block具有更少的参数(RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图8),更少的MACs(multiply-accumulate)(RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图9)。这里假定RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图10,并且token-mixing block中同样使用膨胀参数RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision? - 图11

实验结果

image.png
这里的中,由于模型设定的原因,RaftMLP-12主要和Mixer-B/16和ViT-B/16对比。而RaftMLP-36则主要和ResMLP-36对比。
Although RaftMLP-36 has almost the same parameters and number of FLOPs as ResMLP-36, it is not more accurate than ResMLP-36. However, since RaftMLP and ResMLP have different detailed architectures other than the raft-token-mixing block, the effect of the raft-token-mixing block cannot be directly compared, unlike the comparison with MLP-Mixer. Nevertheless, we can see that raft-token-mixing is working even though the layers are deeper than RaftMLP-12.
关于最后这个模型36的比较,我也没看明白想说个啥,层数更多难道raft-token-mixing可能就不起作用了?

一些扩展与畅想

  • token-mixing block可以扩展到3D情形来替换3D卷积。这样可以用来处理视频。
  • 本文进引入了水平和垂直的空间归纳偏置,以及一些通道的相关性的约束。但是作者也提到,还可以尝试利用其他的归纳偏置:例如平行不变性(parallel invariance,这个不是太明白),层次性(hierarchy)等。

    链接

  • 论文:https://arxiv.org/abs/2108.04384

  • 代码:https://github.com/okojoalg/raft-mlp