swin | Less is More

papers

[swin] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows：微软，multi-level features，window-based

[swin V2] Swin Transformer V2: Scaling Up Capacity and Resolution：卷死了卷死了，同年就上V2，

[PVT 2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions，商汤，也是金字塔形结构，引入reduction ratio来降低computation cost

[Twins 2021] Twins: Revisiting the Design of Spatial Attention in Vision Transformers，美团

[MiT 2021] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers，是一篇语义分割的paper里面，提出了a family of Mix Transformer encoders (MiT)，based on PVT，引入reduction ratio对K降维，起到计算量线性增长的效果

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

swin V1的paper note在：https://amberzzzz.github.io/2021/01/18/transformers/，我们简单look back：
- input embedding：absolute PE 替换成 relative PE
- basic stage
  - basic swin block：W-MSA & SW-MSA
  - patch merging
- classification head
a quick look back on MSA layer：
- scaled dot-Product attention
  - dot：计算每个query和所有keys的similarity，$QK^T$
  - scaled：归一化dot-product的结果，用$\sqrt {d_k}$做分母，$\frac{QK^T}{\sqrt {d_k}}$
- weighted sum
  - softmax：计算query和所有keys的weights，$softmax(\frac{QK^T}{\sqrt {d_k}})$
  - sum：计算query在所有values上的加权和，作为其全局representation，$softmax(\frac{QK^T}{\sqrt {d_k}})V$
- multi-head attention
  - linear project：将输入QKV投影成h个D/h-dim的sub-QKV-pairs
  - attention in parallel：并行对每组sub-QKV计算sub-Q的global attention
  - concat：concat这些heads的输出，也就是所有query的global attention
  - linear project：增加表征复杂度
- masking
  - used in decoder inputs
  - decoder inputs query只计算它之前出现的keys的attention：将其之后的similarity value置为-inf，这样weights就无限接近0了
- 3类regularization：
  - Residual dropout：在每个residual path上面都增加了residual dropout
  - PE dropout：在输入的PE embedding之后添加dropout
  - adds dropout：在每个residual block的sums之后添加dropout
  - $P_{drop}=0.1$
training details
- pretraning：
  - AdamW：weight decay=0.01
  - learning rate：linear decay，5-epoch warm-up，initial=0.001
  - batch size：4096
  - epochs：60
  - an increasing degree of stochastic depth：0.2、0.3、0.5 for Swin-T, Swin-S, and Swin-B
- finetuning：
  - on larger resolution
  - batch size：1024
  - epochs：30
  - a constant learning rate of 1e−5
  - a weight decay of 1e−8
  - the stochastic depth ratio to 0.1
- weights transfer
  - different resolution：swin是window-based的，resolution的改变不直接影响权重
  - different window size：relative position bias需要插值到对应的window size，bi-cubic

Swin Transformer V2: Scaling Up Capacity and Resolution

动机
- scaling up：
  - capacity and resolution
  - generally applicable for vision models
- facing issues
  - instability
  - low resolution pre-trained models到high-resolution downstream task的有效transfer
  - GPU memory consumption
- we present techniques
  - a post normalization technique：instability
  - a scaled cosine attention approach：instability
  - a log-spaced continuous position bias：transfer
  - implementation details that lead to significant GPU savings
论点
- 大模型有用这个事在NLP领域已经被充分证实了：Bert/GPT都是pretrained huge model + downsteam few-shot finetuning、
- CV领域的scaling up稍微lagging behind一点：而且existing model只是用于image classification tasks
- instability
  - 大模型不稳定的主要原因是residual path上面的value直接add to the main branch，the amplitudes accumulate
  - 提出post-norm，将LN移动到residual unit后面，限幅
  - 提出scaled cosine attention，替换原来的dot product attention，好处是cosine product是与amplitude无关的
  - 看图：三角是norm后置，圆形是norm前置
- transfer
  - 提出log-spaced continous position bias (Log-CPB)
  - 之前是bi-cubic interpolation of the position bias maps
  - 看图：第一行是swin V1的差值，transfer到别的window size会显著drop，下面两个CPB一个维持，一个enhance
- GPU usage
  - zero optimizer
  - activation check pointing
  - a novel implementation of sequential self-attention computation
- our model
  - 3 billion params
  - 1536x1536 resolution
  - Nvidia A100-40G GPUs
  - 用更少的数据finetuning就能在downstream task上获得更好的表现
方法
- overview
- normalization configuration
  - common language Transformers和vanilla ViT都是前置norm layer
  - 所以swin V1就inherit这个设置了
  - 但是swin V2重新安排了
- relative position bias
  - key component in V1
  - 没有用PE，而是在MSA里面引入了bias term：$Attention(Q,K,V)=Softmax(QK^T/\sqrt d + B)V$
  - 记一个window里面patches的个数是$M^2$，那么$B \in R^{M^2}$，两个轴上的相对位置范围都是[-M+1,M-1)，有bias matrix $\hat B \in R^{(2M-1)\times(2M-1)}$，然后从中得到$B$，源代码实现的时候用了一个truncated_normal来随机生成$\hat B$，然后在$\hat B$里面取$B$
  - windows size发生变化的时候，bias matrix就进行bi-cubic插值变换
- Scaling Up Model Capacity
  - 在 pre-normalization的设置下
    - the output activation values of each residual block are directly merged back to the main branch
    - main branch在deeper layer的amplitude就越来越大
    - 导致训练不稳定
  - Post normalization
    - 就是MSA、MLP和layerNorm的顺序调换
    - 在largest model training的时候，在main branch也引入了layerNorm，每6个Transformer block就引入一个norm unit
  - Scaled cosine attention
    - 原始的similarity term是Q.dot(K)
    - 但是在post-norm下，发现the learnt attention map容易被个别几对pixel pairs主导
    - 所以改成cosine：$Sim(q_i,k_i)=cos(q_i,k_i)/\tau + B_{ij}$
      - $\tau$ 是learnable scalar
      - larger than 0.01
      - non-shared across heads & layers
  - Scaling Up Window Resolution
    - Continuous relative position bias
      - 用一个小网络来生成relative bias，输入是relative coordinates
      - 2-layer MLP + ReLU in between
    - Log-spaced coordinates
      - 将坐标值压缩到log空间之后，插值的时候，插值空间要小得多
- Implementation to save GPU memory
  - Zero-Redundancy Optimizer (ZeRO)：通常并行情况下，优化器的states是复制多份在每个GPU上的，对大模型极其不友好，解决方案是divided and distributed to multiple GPUs
  - Activation check-pointing：没展开说，就说high resolution下，feature map的存储也占了很大memory，用这个可以提高train speed 30%
  - Sequential self-attention computation：串行计算self-attention，不是batch computation，这个底层矩阵效率优化不理解
- Joining with a self-supervised approach
  - 大模型需要海量数据驱动
  - 一个是扩充imageNet数据集到五倍大，using noisy labels
  - 还进行了自监督学习：additionally employ a self-supervised learning approach to better exploit this data：《Simmim: A simple framework for masked image modeling》，这个还没看过