Hierarchical Vision Transformer using Shift Windows

摘要

Presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

面临的挑战Challenges in adapting Transformer from language to vision arise from differences between the two domains

  • large variations in the scale of visual entities
  • high resolution of pixels in images compared to words in text.

To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows.

面临挑战所采取的措施The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

取得的成果

  • image classification(87.3 top-1 accuracy on ImageNet-1K)
  • Object Detection (58.7 box AP and 51.1 mask AP on COCO test-dev)
    • +2.7 box AP on COCO
    • +2.6 mask AP on COCO compare to state-of-art
  • semantic segmentation (53.5 mIoU on ADE20K val)
    • +3.2 mIoU on AED20K compare to state-of-art

1. 介绍

现代的计算机视觉任务被CNN系列网络所主导,CNN的广泛应用起始于AlexNet在ImageNet上的革命性的表现。随着CNN架构的深度、广度的增加和更优美的结构,使得其更加强大。随着CNN作为各大视觉任务的Backbone,这个架构广阔的提升了整个领域。
另一方面随着Transformer在NLP领域的流行已经取得的巨大成功,使得很多研究者去研究把Transformer引入到计算机视觉任务中来。

  • visual elements can vary substantially in scale, a problem that receives attention in tasks such as object Detection.
  • much higher resolution of pixels in images compared to words in passages of text.

2. 相关工作

CNN及其变种

AlexNet、VGG、GoogleNet、ResNet、DenseNet、HRNet和EfficientNet
虽然CNN及其变种仍然是计算机视觉任务最重要的Backbone,但是我们强调能应用在自然语言处理和计算机视觉的类Transformer-based大一统架构是潜在的Backbone

使用Self-attention和Transformers去完善CNNs

providing the capability to encode distant dependencies or heterogeneous interactions.

  • self-attention layers complement backbones
  • self-attention layers complement head networks

encoder-decoder design in Transformer has been applied for the object detection

Transformer based vision backbones

Vision Transformer(ViT) 以及他的follow-ups

  • ViT在图像分类上的结果喜人,但是却不适合作为一个通用的视觉Backbone
    • low-resolution feature maps
    • the quadratic increase in complexity with image size.

3. 方法

3.1. 整体架构

  • 每个patch被看做是一个token
  • 一些Transformer blocks和编辑过的自注意力模块(Swin Transformer blocks)被用于这些patch tokens。

image.png
The architecture of a Swin Transformer (Swin-T)

image.png
two successive Swin Transformer Blocks (notation presented with Eq. (3)). W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.

image.png

Stage1

Several Transformer blocks with modified self-attention computation(Swin Transformer blocks) are applied on patch tokens.

Stage2

To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The

Swin Transformer block

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows (described in Section 3.2), with other lay- ers kept the same. As illustrated in Figure 3(b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU non- linearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

3.2. Shifted Window based Self-Attention

image.png

Self-attention in non-overlapped windows

Shifted window partitioning in successive blocks

Efficient batch computation for shifted configuration

Relative position bias

3.3. Architecture Variants

mmdet中swin代码源码解析

SwinTransformer类

关键参数名称 参数解释 默认值
pretrain_img_size (int | tuple[int]) The size of input image when pretrain. 224
in_channels (int) The num of input channels. 3
embed_dims (int) The feature dimension. 96
patch_size (int | tuple[int]) Patch size. 4
window_size (int) Window size. 7
mlp_ratio (int) Ratio of mlp hidden dim to embedding dim. 4
depths (tuple[int]) Depths of each Swin Transformer stage. (2, 2, 6, 2)
num_heads (tuple[int]) Parallel attention heads of each Swin Transformer stage. (3, 6, 12, 24)
strides (tuple[int]) The patch merging or patch embedding stride of each Swin Transformer stage. (In swin, we set kernel size equal to stride.) (4, 2, 2, 2)
out_indices (tuple[int]) Output from which stages. Default: (0, 1, 2, 3).
qkv_bias (bool, optional) If True, add a learnable bias to query, key, value. True
qk_scale (float | None, optional) Override default qk scale of head_dim ** -0.5 if set. None
patch_norm (bool) If add a norm layer for patch embed and patch merging. True
drop_rate (float) Dropout rate. 0
attn_drop_rate (float) Attention dropout rate. 0
drop_path_rate (float) Stochastic depth rate. 0.1
use_abs_pos_embed If True, add absolute position embedding to the patch embedding. False
act_cfg (dict) Config dict for activation layer. dict(type=’LN’)
norm_cfg (dict) Config dict for normalization layer at output of backone. dict(type=’LN’)
with_cp (bool, optional) Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. False
pretrained (str, optional) model pretrained path. None
convert_weights (bool) The flag indicates whether the pre-trained model is from the original repo. We may need to convert some keys to make it compatible. False
frozen_stages (int) Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.
init_cfg (dict, optional): The Config for initialization. None

Swin Transformer 通道数变换

Swin Transformer论文及源码精读 - 图5

参考

https://jalammar.github.io/illustrated-transformer/
视频【沈向洋带你读论文】Swin Transformer 马尔奖论文(ICCV 2021最佳论文)

(YouTube)Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)