resnets

overview

  1. papers

    [resnet] ResNet: Deep Residual Learning for Image Recognition

    [resnext] ResNext: Aggregated Residual Transformations for Deep Neural Networks

    [resnest] ResNeSt: Split-Attention Networks

    [revisiting resnets] Revisiting ResNets: Improved Training and Scaling Strategies

ResNext: Aggregated Residual Transformations for Deep Neural Networks

  1. 动机

    • new network architecture
    • new building blocks with the same topology
    • propose cardinality
      • increasing cardinality is able to improve classification accuracy
      • is more effective than going deeper or wider
    • classification task
  2. 论点

    • VGG & ResNets:

      • stacking building blocks of the same topology
      • deeper
      • reduces the free choices of hyper-parameters
    • Inception models

      • split-transform-merge strategy
      • split:1x1conv spliting into a few lower-dimensional embeddings
      • transform:a set of specialized filters
      • merge:concat

      • approach the representational power of large and dense layers, but at a considerably lower computational complexity

      • modules are customized stage-by-stage
    • our architecture

      • adopts VGG/ResNets’ repeating layers
      • adopts Inception‘s split-transform-merge strategy
      • aggregated by summation
      • cardinality:the size of the set of transformations(split path数)

        • 多了1x1 conv的计算量
        • 少了3x3 conv的计算量

  3. 要素

    • Multi-branch convolutional blocks
    • Grouped convolutions:通道对齐,稀疏连接
    • Compressing convolutional networks
    • Ensembling
  4. 方法

    • architecture

      • a template module
      • if producing spatial maps of the same size, the blocks share the same hyper-parameters (width and filter sizes)
      • when the spatial map is downsampled by a factor of 2, the width of the blocks is multiplied by a factor of 2

      • grouped convolutions:第一个1x1和3x3conv的width要根据C进行split

      • equivalent blocks

        • BN after each conv
        • ReLU after each BN except the last of block
        • ReLU after add
        • r

      • Model Capacity

        • improve accuracy when maintaining the model complexity and number of parameters

        • adjust the width of bottleneck, the according C to maintain capacity:C=1的时候退化成ResNet block

ResNeSt: Split-Attention Networks

  1. 动机

    • propose a modular Split-Attention block
    • enables attention across feature-map groups
    • preserve the overall ResNet structure for downstream applications such as object detection and semantic segmentation
    • prove improvement on detection & segmentation tasks
  2. 论点

    • ResNet
      • simple and modular design
      • limited receptive-field size and lack of cross-channel interaction
    • image classification networks have focused more on group or depth-wise convolution
      • do not transfer well to other tasks
      • isolated representations cannot capture cross-channel relationships
    • a versatile backbone
      • improving performance across multiple tasks at the same time
      • a network with cross-channel representations is desirable
      • a Split-Attention block
        • divides the feature-map into several groups (along the channel dimension)
        • finer-grained subgroups or splits
        • weighted combination
    • featuremap attention mechanism:NiN’s 1x1 conv
    • Multi-path:GoogleNet
    • channel-attention mechanism:SE-Net
    • 结构上,全局上看,模仿ResNext,引入cardinality和group conv,局部上看,每个group内部继续分组,然后模仿SK-Net,融合多个分支的split-attention,大group之间concat,而不是ResNext的add,再经1x1 conv调整维度,add id path

  3. 方法

    • Split-Attention block

      • enables feature-map attention across different feature-map groups

      • within a block:controlled by cardinality

      • within a cardinal group:introduce a new radix hyperparameter R indicating the number of splits

      • split-attention

        • 多个in-group branch的input输入进来
        • fusion:先做element-wise summation
        • channel-wise global contextual information:做global average pooling
        • 降维:Dense-BN-ReLU
        • 各分支Dense(the attention weight function):学习各自的重要性权重
        • channel-wise soft attention:对全部的dense做softmax
        • 加权:原始的各分支input与加权的dense做乘法
        • 和:加权的各分支add
        • r=1:退化成SE-blockaverage pooling

      • shortcut connection

        • for blocks with a strided convolution or combined convolution-with-pooling can be applied to the id
      • concat

      • average pooling downsampling

        • for dense prediction tasks:it becomes essential to preserve spatial information
        • former work tend to use strided 3x3 conv
        • we use an average pooling layer with 3x3 kernel
        • 2x2 average pooling applied to strided shortcut connection before 1x1 conv

Revisiting ResNets: Improved Training and Scaling Strategies

  1. 动机

    • disentangle the three aspects

      • model architecture
      • training methodology
      • scaling strategies
    • improve ResNets to SOTA

      • design a family of ResNet architectures, ResNet-RS
      • use improved training and scaling strategies
      • and combine minor architecture changes
      • 在ImageNet上打败efficientNet
      • 在半监督上打败efficientNet-noisystudent

  2. 论点

    • ImageNet上榜大法
      • Architecture
        • 人工系列:AlexNet,VGG,ResNet,Inception,ResNeXt
        • NAS系列:NasNet-A,AmoebaNet-A,EfficientNet
      • Training and Regularization Methods
        • regularization methods
          • dropout,label smoothing,stochastic depth,dropblock,data augmentation
          • significantly improve generalization when training more epochs
        • training
          • learning rate schedules
      • Scaling Strategies
        • model dimension:width,depth,resolution
        • efficientNet提出的均衡增长,在本文中shows sub-optimal for both resnet and efficientNet
      • Additional Training Data
        • pretraining on larger dataset
        • semi-supervised
    • the performance of a vision model
      • architecture:most research focus on
      • training methods and scaling strategy:less publicized but critical
      • unfair:使用modern training method的新架构与使用dated methods的老网络直接对比
    • we focus on the impact of training methods and scaling strategies
      • training methods:
        • We survey the modern training and regularization techniques
        • 发现引入其他正则方法的时候降低一点weight decay有好处
      • scaling strategies:
        • We offer new perspectives and practical advice on scaling
        • 可能出现过拟合的时候就加depth,否则先加宽
        • resolution慢点增长,more slowly than prior works
        • 从acc图可以看到:我们的scaling strategies与网络结构的lightweight change正交,是additive的
    • re-scaled ResNets, ResNet-RS
      • 仅improve training & scaling strategy就能大幅度涨点
      • combine minor architectural changes进一步涨点
  3. 方法

    • architecture

      • use ResNet with two widely used architecture changes
      • ResNet-D
        • stem的7x7conv换成3个3x3conv
        • stem的maxpooling去掉,每个stage的首个3x3conv负责stride2
        • residual path上前两个卷积的stride互换(在3x3上下采样)
        • id path上的1x1 s2conv替换成2x2 s2的avg pooling+1x1conv
      • SE in bottleneck

        • use se-ratio of 0.25

    • training methods

      • match the efficientNet setup
        • train for 350 epochs
        • use cosine learning rate instead of exponential decay
        • RandAugment instead of AutoAugment
        • use Momentum optimizer instead of RMSProp
      • regularization
        • weight decay
        • label smoothing
        • dropout
        • stochastic depth
      • data augmentation

        • we use RandAugment
        • EfficientNet use AutoAugment which slightly outperforms RandAugment

      • hyper:

        • droprate
        • increase the regularization as the model size increase to limit overfitting
        • label smoothing = 0.1
        • weight decay = 4e-5
    • improved training methods

      • additive study

        • 总体上看都是additive的
        • increase training epochs在添加regularization methods的前提下才不hurt,否则会overfitting
        • dropout在不降低weight decay的情况下会hurt

      • weight decay

        • 少量/没有regularization methods的情况下:大weight decay防止过拟合,1e-4
        • 多/强regularization methods的情况下:适当减小weight decay能涨点,4e-5
    • improved scaling strategies

      • search space

        • width multiplier:[0.25, 0.5, 1.0, 1.5, 2.0]
        • depth:[26, 50, 101, 200, 300, 350, 400]
        • resolution:[128, 160, 224, 320, 448]
        • increase regularization as model size increase
        • observe 10/100/350 epoch regime
      • we found that the best scaling strategies depends on training regime

      • strategy1:scale depth

        • Depth scaling outperforms width scaling for longer epoch regimes
        • width scaling is preferable for shorter epoch regimes
        • scaling width可能会引起overfitting,有时候会hurt performance
        • depth scaling引入的参数量也比width小
      • strategy2:slow resolution scaling

        • efficientNets/resNeSt lead to very large images
        • our experiments:大可不必

  4. 实验