repVGG

RepVGG: Making VGG-style ConvNets Great Again

动机
- plain ConvNets
  - simply efficient but poor performance
- propose a CNN architecture RepVGG
  - 能够decouple为training-time和inference-time两个结构
  - 通过structure re-paramterization technique
  - inference-time architecture has a VGG-like plain body
- faster
  - 83% faster than ResNet-50 or 101% faster than ResNet-101
- accuracy-speed trade-off
  - reaches over 80% top-1 accuracy
  - outperforms ResNets by a large margin
- verify on classification & semantic segmentation tasks
论点
- well-designed CNN architectures
  - Inception，ResNet，DenseNet，NAS models
  - deliver higher accuracy
  - drawbacks
    - multi-branch designs：slow down inference and reduce memory utilization，对高并行化的设备不友好
    - some components：depthwise & channel shuffle，increase memory access cost
  - MAC(memory access cost) constitutes a large time usage in groupwise convolution：我的groupconv实现里cardinality维度上计算不并行
  - FLOPs并不能precisely reflect actual speed，一些结构看似比old fashioned VGG/resnet的FLOPs少，但实际并没有快
- multi-branch
  - 通常multi-branch model要比plain model表现好
  - 因为makes the model an implicit ensemble of numerous shallower models
  - so that avoids gradient vanishing
  - benefits are all for training
  - drawbacks are undesired for inference
- the proposed RepVGG
  - advantages
    - plain architecture：no branches
    - 3x3 conv & ReLU组成
    - 没有过重的人工设计痕迹
  - training time use identity & 1x1 conv branches
  - at inference time
    - identity 可以看做degraded 1x1 conv
    - 1x1 conv 可以看做degraded 3x3 conv
    - 最终整个conv-bn branches能够整合成一个3x3 conv
    - inference-time model只包含conv和ReLU：没有max pooling！！
    - fewer memory units：分支会占内存，直到分支计算结束，plain结构的memory则是immediately released
方法
- training-time
  - ResNet-like block
    - id + 1x1 conv + 3x3 conv multi-branches
    - use BN in each branch
    - with n blocks, the model can be interpreted as an ensemble of $3^n$ models
    - stride2的block应该没有id path吧？？
  - simply stack serveral blocks to construct the training model
- inference-time
  - re-param
    - inference-time BN也是一个线性计算
    - 两个1x1 conv都可以转换成中通的3x3 kernel，有权/无权
    - 要求各branch has the same strides & padding pixel要对齐
  - architectural specification
    - variety：depth and width
    - does not use maxpooling：只有一种operator：3x3 conv+relu
    - head：GAP + fc / task specific
    - 5 stages
      - 第一个stage处理high resolution，stride2
      - 第五个stage shall have more channels，所以只用一层，save parameters
      - 给倒数第二个stage最多层，考虑params和computation的balance
    - RepVGG-A：[1,2,4,14,1]，用来compete against轻量和中量级model
    - RepVGG-B：deeper in s2,3,4，[1,4,6,16,1]，用来compete against high-performance ones
    - basic width：[64, 128, 256, 512]
      - width multiplier a & b
      - a控制前4个stage宽度，b控制最后一个stage
      - [64a, 128a, 256a, 512b]
      - 第一个stage的宽度只接受变小不接受变大，因为大resolution影响计算量，min(64,64a)
    - further reduce params & computation
      - groupwise 3x3 conv
      - 跳着层换：从第三开始，第三、第五、
      - number of groups：1，2，4 globally
实验
- 分支的作用
- 结构上的微调
  - id path去掉BN
  - 把所有的BN移动到add的后面
  - 每个path加上relu
- ImageNet分类任务上对标其他模型
  - simple augmentation
  - strong：Autoaugment, label smoothing and mixup