- S2-MLP V1&V2 : Spatial-Shift MLP Architecture for Vision
- FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction
- Attention Augmented Convolutional Networks
- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
- LambdaNetworks: Modeling long-range Interactions without Attention
- Involution: Inverting the Inherence of Convolution for Visual Recognition
- ConvMLP: Hierarchical Convolutional MLPs for Vision
- Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
- Hire-MLP: Vision MLP via Hierarchical Rearrangement
- RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
- CycleMLP: A MLP-like Architecture for Dense Prediction
- Vision Transformers with Hierarchical Attention
- X-volution: On the Unification of Convolution and Self-attention
- On the Integration of Self-Attention and Convolution
- DynaMixer: A Vision MLP Architecture with Dynamic Mixing
- ELSA: Enhanced Local Self-Attention for Vision Transformer
- Container: Context Aggregation Network
- Neighborhood Attention Transformer
- EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers
- TRT-ViT: TensorRT-oriented Vision Transformer
- Fast Vision Transformers with HiLo Attention
- ActiveMLP: An MLP-like Architecture with Active Token Mixer