ConvNext

facebook，2022，https://github.com/facebookresearch/ConvNeXt

inductive biases（归纳偏置）

卷积具有较强的归纳偏置：即strong man-made settings，如local kernel和shared weights，只有spatial neighbor之间有关联，且在不同位置提取特征的卷积核共享——视觉边角特征与空间位置无关
相比之下，transformer结构就没有这种很人为的先验的设定，就是global的优化目标，所以收敛也慢

A ConvNet for the 2020s

动机
- reexamine the design spaces and test the limits of what a pure ConvNet can achieve
- 精度
  - achieving 87.8% ImageNet top-1 acc
  - outperforming Swin Transformers on COCO detection and ADE20K segmentation
论点
- conv
  - a sliding window strategy is intrinsic
  - built-in inductive biases：卷积的归纳偏置是locality和spatial invariance
    - 即空间相近的grid elements有联系而远的没有：translation equivariance is a desirable property
    - 空间不变性：shared weights，inherently efficient
- ViT
  - 除了第一层的patchify layer引入卷积，其余结构introduces no image-specific inductive bias
  - global attention这个设定的主要问题是平方型增长的计算量
  - 使得这个结构在classification任务上比较work，但是在其他任务场景里面（需要high resolution，需要hierarchical features）使用受限
- Hierarchical Transformers
  - hybrid approach：重新引入local attention这个理念
  - 能够用于各类任务
  - 揭露了卷积/locality的重要性
- this paper brings back convolutions
  - propose a family of pure ConvNets called ConvNeXt
  - a Roadmap：from ResNet to ConvNet
方法
- from ResNet to ConvNet
  - ResNet-50 / Swin-T：FLOPs around 4.5e9
  - ResNet-200 / Swin-B around 15e9
  - 首先用transformer的训练技巧训练初始的resnet，作为baseline，然后逐步改进结构
    - macro design
    - ResNeXt
    - inverted bottleneck
    - large kernel size
    - various layer-wise micro designs
- Training Techniques
  - 300 epochs
  - AdamW
  - aug：Mixup，CutMix，RandAugment，Random Erasing
  - reg：Stochastic Depth，Label Smoothing
  - 这就使得resnet的精度从76.1%提升到78.8%
- Macro Design
  - 宏观结构就是multi-stage，每个stage的resolution不同，涉及的结构设计有
    - stage compute ratio
    - stem cell
  - swin的stage比是1:1:3:1，larger model是1:1:9:1，因此将resnet50的3:4:6:3调整成3:3:9:3，acc从 78.8% 提升至 79.4%
  - 将stem替换成更加aggressive的patchify，4x4conv，s4，non-overlapping，acc从 79.4% 提升至 79.5%
- ResNeXt-ify
  - 用分组卷积来实现更好的FLOPs/acc的trade-off
  - 分组卷积带来的model capacity loss用增加网络宽度来实现
  - 使用depthwise convolution，同时width从64提升到96
    - groups=channels
    - similar to the weighted sum of self-attention：在spatial-dim上mix information
  - acc提升至80.5%，FLOPs增加5.3G
- Inverted Bottleneck
  - transformer block的ffn中，hidden layer的宽度是输入宽度的4倍
  - MobileNet & EfficientNet里面也有类似的结构：中间大，头尾小
  - 而原始的resne(X)t是bottleneck结构：中间小，两头大，为了节约计算量
  - reduce FLOPs：因为shortcut上面的1x1计算量小了
  - 精度稍有提升：80.5% to 80.6%，R200/Swin-B上则更显著一点，81.9% to 82.6%
- Large Kernel Sizes
  - 首先将conv layer提前，类比transformer的MSA+FFN
  - reduce FLOPs，同时精度下降至79.9%
  - 然后增大kernel size，尝试[3,5,7,9,11]，发现在7的时候精度饱和
  - acc：from 79.9% (3×3) to 80.6% (7×7)
- Micro Design：layer-level的一些尝试
  - Replacing ReLU with GELU：原始的transformer paper里面也是用的ReLU，但是后面的先进transformer里面大量用了GeLU，实验发现可以替换，但是精度不变
  - Fewer activation functions：transformer block里面有QKV dense，有proj dense，还有FFN里的两个fc层，其中只有FFN的hidden layer接了个GeLU，而原始的resnet每个conv后面都加了relu，我们将resnet也改成只有类似线性层的两个1x1 conv之间有激活函数，acc提升至81.3%，nearly match Swin
  - Fewer normalization layers：我们比transformer还少用一个norm（因为实验发现加上入口那个LN没提升），acc提升至81.4%，already surpass Swin
  - Substituting BN with LN：BN对于convNet，能够加快收敛抑制过拟合，直接给resnet替换LN会导致精度下降，但是在逐步改进的block上面替换则会slightly提升，81.5%
  - Separate downsampling layers：学Swin，不再将stride2嵌入resnet conv，而是使用独立的2x2 s2conv，同时发现在resolution改变的时候加入norm layer能够stabilize training——每个downsamp layer/stem/final GAP之后都加一个LN，acc提升至82%，significantly exceeding Swin
- overall structural params