摘要

Presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

面临的挑战Challenges in adapting Transformer from language to vision arise from differences between the two domains

large variations in the scale of visual entities
high resolution of pixels in images compared to words in text.

To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows.

面临挑战所采取的措施The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

取得的成果

image classification（87.3 top-1 accuracy on ImageNet-1K）
Object Detection （58.7 box AP and 51.1 mask AP on COCO test-dev)
- +2.7 box AP on COCO
- +2.6 mask AP on COCO compare to state-of-art
semantic segmentation （53.5 mIoU on ADE20K val）
- +3.2 mIoU on AED20K compare to state-of-art

1. 介绍

现代的计算机视觉任务被CNN系列网络所主导，CNN的广泛应用起始于AlexNet在ImageNet上的革命性的表现。随着CNN架构的深度、广度的增加和更优美的结构，使得其更加强大。随着CNN作为各大视觉任务的Backbone，这个架构广阔的提升了整个领域。
另一方面随着Transformer在NLP领域的流行已经取得的巨大成功，使得很多研究者去研究把Transformer引入到计算机视觉任务中来。

visual elements can vary substantially in scale, a problem that receives attention in tasks such as object Detection.
much higher resolution of pixels in images compared to words in passages of text.

2. 相关工作

CNN及其变种

AlexNet、VGG、GoogleNet、ResNet、DenseNet、HRNet和EfficientNet
虽然CNN及其变种仍然是计算机视觉任务最重要的Backbone，但是我们强调能应用在自然语言处理和计算机视觉的类Transformer-based大一统架构是潜在的Backbone

使用Self-attention和Transformers去完善CNNs

providing the capability to encode distant dependencies or heterogeneous interactions.

self-attention layers complement backbones
self-attention layers complement head networks

encoder-decoder design in Transformer has been applied for the object detection

Transformer based vision backbones

Vision Transformer(ViT) 以及他的follow-ups

ViT在图像分类上的结果喜人，但是却不适合作为一个通用的视觉Backbone
- low-resolution feature maps
- the quadratic increase in complexity with image size.

3. 方法

3.1. 整体架构

每个patch被看做是一个token
一些Transformer blocks和编辑过的自注意力模块(Swin Transformer blocks)被用于这些patch tokens。

The architecture of a Swin Transformer (Swin-T)

two successive Swin Transformer Blocks (notation presented with Eq. (3)). W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.

Stage1

Several Transformer blocks with modified self-attention computation(Swin Transformer blocks) are applied on patch tokens.

Stage2

To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The

Swin Transformer block

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows (described in Section 3.2), with other lay- ers kept the same. As illustrated in Figure 3(b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU non- linearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

3.2. Shifted Window based Self-Attention

Self-attention in non-overlapped windows

Shifted window partitioning in successive blocks

Efficient batch computation for shifted configuration

Relative position bias

3.3. Architecture Variants

mmdet中swin代码源码解析

SwinTransformer类

关键参数名称	参数解释	默认值
pretrain_img_size (int \| tuple[int])	The size of input image when pretrain.	224
in_channels (int)	The num of input channels.	3
embed_dims (int)	The feature dimension.	96
patch_size (int \| tuple[int])	Patch size.	4
window_size (int)	Window size.	7
mlp_ratio (int)	Ratio of mlp hidden dim to embedding dim.	4
depths (tuple[int])	Depths of each Swin Transformer stage.	(2, 2, 6, 2)
num_heads (tuple[int])	Parallel attention heads of each Swin Transformer stage.	(3, 6, 12, 24)
strides (tuple[int])	The patch merging or patch embedding stride of each Swin Transformer stage. (In swin, we set kernel size equal to stride.)	(4, 2, 2, 2)
out_indices (tuple[int])	Output from which stages.	Default: (0, 1, 2, 3).
qkv_bias (bool, optional)	If True, add a learnable bias to query, key, value.	True
qk_scale (float \| None, optional)	Override default qk scale of head_dim ** -0.5 if set.	None
patch_norm (bool)	If add a norm layer for patch embed and patch merging.	True
drop_rate (float)	Dropout rate.	0
attn_drop_rate (float)	Attention dropout rate.	0
drop_path_rate (float)	Stochastic depth rate.	0.1
use_abs_pos_embed	If True, add absolute position embedding to the patch embedding.	False
act_cfg (dict)	Config dict for activation layer.	dict(type=’LN’)
norm_cfg (dict)	Config dict for normalization layer at output of backone.	dict(type=’LN’)
with_cp (bool, optional)	Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.	False
pretrained (str, optional)	model pretrained path.	None
convert_weights (bool)	The flag indicates whether the pre-trained model is from the original repo. We may need to convert some keys to make it compatible.	False
frozen_stages (int)	Stages to be frozen (stop grad and set eval mode).	-1 means not freezing any parameters.
init_cfg (dict, optional):	The Config for initialization.	None

Swin Transformer 通道数变换

Swin Transformer论文及源码精读 - 图5

参考

https://jalammar.github.io/illustrated-transformer/
视频【沈向洋带你读论文】Swin Transformer 马尔奖论文（ICCV 2021最佳论文）

(YouTube)Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)