resnets

overview

papers

[resnet] ResNet: Deep Residual Learning for Image Recognition

[resnext] ResNext: Aggregated Residual Transformations for Deep Neural Networks

[resnest] ResNeSt: Split-Attention Networks

[revisiting resnets] Revisiting ResNets: Improved Training and Scaling Strategies

ResNext: Aggregated Residual Transformations for Deep Neural Networks

动机
- new network architecture
- new building blocks with the same topology
- propose cardinality
  - increasing cardinality is able to improve classification accuracy
  - is more effective than going deeper or wider
- classification task
论点
- VGG & ResNets：
  - stacking building blocks of the same topology
  - deeper
  - reduces the free choices of hyper-parameters
- Inception models
  - split-transform-merge strategy
  - split：1x1conv spliting into a few lower-dimensional embeddings
  - transform：a set of specialized filters
  - merge：concat
  - approach the representational power of large and dense layers, but at a considerably lower computational complexity
  - modules are customized stage-by-stage
- our architecture
  - adopts VGG/ResNets’ repeating layers
  - adopts Inception‘s split-transform-merge strategy
  - aggregated by summation
  - cardinality：the size of the set of transformations（split path数）
    - 多了1x1 conv的计算量
    - 少了3x3 conv的计算量
要素
- Multi-branch convolutional blocks
- Grouped convolutions：通道对齐，稀疏连接
- Compressing convolutional networks
- Ensembling
方法
- architecture
  - a template module
  - if producing spatial maps of the same size, the blocks share the same hyper-parameters (width and filter sizes)
  - when the spatial map is downsampled by a factor of 2, the width of the blocks is multiplied by a factor of 2
  - grouped convolutions：第一个1x1和3x3conv的width要根据C进行split
  - equivalent blocks
    - BN after each conv
    - ReLU after each BN except the last of block
    - ReLU after add
    - r
  - Model Capacity
    - improve accuracy when maintaining the model complexity and number of parameters
    - adjust the width of bottleneck, the according C to maintain capacity：C=1的时候退化成ResNet block

ResNeSt: Split-Attention Networks

动机
- propose a modular Split-Attention block
- enables attention across feature-map groups
- preserve the overall ResNet structure for downstream applications such as object detection and semantic segmentation
- prove improvement on detection & segmentation tasks
论点
- ResNet
  - simple and modular design
  - limited receptive-field size and lack of cross-channel interaction
- image classification networks have focused more on group or depth-wise convolution
  - do not transfer well to other tasks
  - isolated representations cannot capture cross-channel relationships
- a versatile backbone
  - improving performance across multiple tasks at the same time
  - a network with cross-channel representations is desirable
  - a Split-Attention block
    - divides the feature-map into several groups (along the channel dimension)
    - finer-grained subgroups or splits
    - weighted combination
- featuremap attention mechanism：NiN’s 1x1 conv
- Multi-path：GoogleNet
- channel-attention mechanism：SE-Net
- 结构上，全局上看，模仿ResNext，引入cardinality和group conv，局部上看，每个group内部继续分组，然后模仿SK-Net，融合多个分支的split-attention，大group之间concat，而不是ResNext的add，再经1x1 conv调整维度，add id path
方法
- Split-Attention block
  - enables feature-map attention across different feature-map groups
  - within a block：controlled by cardinality
  - within a cardinal group：introduce a new radix hyperparameter R indicating the number of splits
  - split-attention
    - 多个in-group branch的input输入进来
    - fusion：先做element-wise summation
    - channel-wise global contextual information：做global average pooling
    - 降维：Dense-BN-ReLU
    - 各分支Dense(the attention weight function)：学习各自的重要性权重
    - channel-wise soft attention：对全部的dense做softmax
    - 加权：原始的各分支input与加权的dense做乘法
    - 和：加权的各分支add
    - r=1：退化成SE-blockaverage pooling
  - shortcut connection
    - for blocks with a strided convolution or combined convolution-with-pooling can be applied to the id
  - concat
  - average pooling downsampling
    - for dense prediction tasks：it becomes essential to preserve spatial information
    - former work tend to use strided 3x3 conv
    - we use an average pooling layer with 3x3 kernel
    - 2x2 average pooling applied to strided shortcut connection before 1x1 conv

Revisiting ResNets: Improved Training and Scaling Strategies

动机
- disentangle the three aspects
  - model architecture
  - training methodology
  - scaling strategies
- improve ResNets to SOTA
  - design a family of ResNet architectures, ResNet-RS
  - use improved training and scaling strategies
  - and combine minor architecture changes
  - 在ImageNet上打败efficientNet
  - 在半监督上打败efficientNet-noisystudent
论点
- ImageNet上榜大法
  - Architecture
    - 人工系列：AlexNet，VGG，ResNet，Inception，ResNeXt
    - NAS系列：NasNet-A，AmoebaNet-A，EfficientNet
  - Training and Regularization Methods
    - regularization methods
      - dropout，label smoothing，stochastic depth，dropblock，data augmentation
      - significantly improve generalization when training more epochs
    - training
      - learning rate schedules
  - Scaling Strategies
    - model dimension：width，depth，resolution
    - efficientNet提出的均衡增长，在本文中shows sub-optimal for both resnet and efficientNet
  - Additional Training Data
    - pretraining on larger dataset
    - semi-supervised
- the performance of a vision model
  - architecture：most research focus on
  - training methods and scaling strategy：less publicized but critical
  - unfair：使用modern training method的新架构与使用dated methods的老网络直接对比
- we focus on the impact of training methods and scaling strategies
  - training methods：
    - We survey the modern training and regularization techniques
    - 发现引入其他正则方法的时候降低一点weight decay有好处
  - scaling strategies：
    - We offer new perspectives and practical advice on scaling
    - 可能出现过拟合的时候就加depth，否则先加宽
    - resolution慢点增长，more slowly than prior works
    - 从acc图可以看到：我们的scaling strategies与网络结构的lightweight change正交，是additive的
- re-scaled ResNets, ResNet-RS
  - 仅improve training & scaling strategy就能大幅度涨点
  - combine minor architectural changes进一步涨点
方法
- architecture
  - use ResNet with two widely used architecture changes
  - ResNet-D
    - stem的7x7conv换成3个3x3conv
    - stem的maxpooling去掉，每个stage的首个3x3conv负责stride2
    - residual path上前两个卷积的stride互换（在3x3上下采样）
    - id path上的1x1 s2conv替换成2x2 s2的avg pooling+1x1conv
  - SE in bottleneck
    - use se-ratio of 0.25
- training methods
  - match the efficientNet setup
    - train for 350 epochs
    - use cosine learning rate instead of exponential decay
    - RandAugment instead of AutoAugment
    - use Momentum optimizer instead of RMSProp
  - regularization
    - weight decay
    - label smoothing
    - dropout
    - stochastic depth
  - data augmentation
    - we use RandAugment
    - EfficientNet use AutoAugment which slightly outperforms RandAugment
  - hyper：
    - droprate
    - increase the regularization as the model size increase to limit overfitting
    - label smoothing = 0.1
    - weight decay = 4e-5
- improved training methods
  - additive study
    - 总体上看都是additive的
    - increase training epochs在添加regularization methods的前提下才不hurt，否则会overfitting
    - dropout在不降低weight decay的情况下会hurt
  - weight decay
    - 少量/没有regularization methods的情况下：大weight decay防止过拟合，1e-4
    - 多/强regularization methods的情况下：适当减小weight decay能涨点，4e-5
- improved scaling strategies
  - search space
    - width multiplier：[0.25, 0.5, 1.0, 1.5, 2.0]
    - depth：[26, 50, 101, 200, 300, 350, 400]
    - resolution：[128, 160, 224, 320, 448]
    - increase regularization as model size increase
    - observe 10/100/350 epoch regime
  - we found that the best scaling strategies depends on training regime
  - strategy1：scale depth
    - Depth scaling outperforms width scaling for longer epoch regimes
    - width scaling is preferable for shorter epoch regimes
    - scaling width可能会引起overfitting，有时候会hurt performance
    - depth scaling引入的参数量也比width小
  - strategy2：slow resolution scaling
    - efficientNets/resNeSt lead to very large images
    - our experiments：大可不必
实验