overview
papers
[resnet] ResNet: Deep Residual Learning for Image Recognition
[resnext] ResNext: Aggregated Residual Transformations for Deep Neural Networks
[resnest] ResNeSt: Split-Attention Networks
[revisiting resnets] Revisiting ResNets: Improved Training and Scaling Strategies
ResNext: Aggregated Residual Transformations for Deep Neural Networks
动机
- new network architecture
- new building blocks with the same topology
- propose cardinality
- increasing cardinality is able to improve classification accuracy
- is more effective than going deeper or wider
- classification task
论点
VGG & ResNets:
- stacking building blocks of the same topology
- deeper
- reduces the free choices of hyper-parameters
Inception models
- split-transform-merge strategy
- split:1x1conv spliting into a few lower-dimensional embeddings
- transform:a set of specialized filters
merge:concat
approach the representational power of large and dense layers, but at a considerably lower computational complexity
- modules are customized stage-by-stage
our architecture
要素
- Multi-branch convolutional blocks
- Grouped convolutions:通道对齐,稀疏连接
- Compressing convolutional networks
- Ensembling
方法
architecture
- a template module
- if producing spatial maps of the same size, the blocks share the same hyper-parameters (width and filter sizes)
when the spatial map is downsampled by a factor of 2, the width of the blocks is multiplied by a factor of 2
grouped convolutions:第一个1x1和3x3conv的width要根据C进行split
equivalent blocks
Model Capacity
ResNeSt: Split-Attention Networks
动机
- propose a modular Split-Attention block
- enables attention across feature-map groups
- preserve the overall ResNet structure for downstream applications such as object detection and semantic segmentation
- prove improvement on detection & segmentation tasks
论点
- ResNet
- simple and modular design
- limited receptive-field size and lack of cross-channel interaction
- image classification networks have focused more on group or depth-wise convolution
- do not transfer well to other tasks
- isolated representations cannot capture cross-channel relationships
- a versatile backbone
- improving performance across multiple tasks at the same time
- a network with cross-channel representations is desirable
- a Split-Attention block
- divides the feature-map into several groups (along the channel dimension)
- finer-grained subgroups or splits
- weighted combination
- featuremap attention mechanism:NiN’s 1x1 conv
- Multi-path:GoogleNet
- channel-attention mechanism:SE-Net
结构上,全局上看,模仿ResNext,引入cardinality和group conv,局部上看,每个group内部继续分组,然后模仿SK-Net,融合多个分支的split-attention,大group之间concat,而不是ResNext的add,再经1x1 conv调整维度,add id path
- ResNet
方法
Split-Attention block
enables feature-map attention across different feature-map groups
within a block:controlled by cardinality
within a cardinal group:introduce a new radix hyperparameter R indicating the number of splits
split-attention
- 多个in-group branch的input输入进来
- fusion:先做element-wise summation
- channel-wise global contextual information:做global average pooling
- 降维:Dense-BN-ReLU
- 各分支Dense(the attention weight function):学习各自的重要性权重
- channel-wise soft attention:对全部的dense做softmax
- 加权:原始的各分支input与加权的dense做乘法
- 和:加权的各分支add
r=1:退化成SE-blockaverage pooling
shortcut connection
- for blocks with a strided convolution or combined convolution-with-pooling can be applied to the id
concat
average pooling downsampling
- for dense prediction tasks:it becomes essential to preserve spatial information
- former work tend to use strided 3x3 conv
- we use an average pooling layer with 3x3 kernel
- 2x2 average pooling applied to strided shortcut connection before 1x1 conv
Revisiting ResNets: Improved Training and Scaling Strategies
动机
disentangle the three aspects
- model architecture
- training methodology
- scaling strategies
improve ResNets to SOTA
论点
- ImageNet上榜大法
- Architecture
- 人工系列:AlexNet,VGG,ResNet,Inception,ResNeXt
- NAS系列:NasNet-A,AmoebaNet-A,EfficientNet
- Training and Regularization Methods
- regularization methods
- dropout,label smoothing,stochastic depth,dropblock,data augmentation
- significantly improve generalization when training more epochs
- training
- learning rate schedules
- regularization methods
- Scaling Strategies
- model dimension:width,depth,resolution
- efficientNet提出的均衡增长,在本文中shows sub-optimal for both resnet and efficientNet
- Additional Training Data
- pretraining on larger dataset
- semi-supervised
- Architecture
- the performance of a vision model
- architecture:most research focus on
- training methods and scaling strategy:less publicized but critical
- unfair:使用modern training method的新架构与使用dated methods的老网络直接对比
- we focus on the impact of training methods and scaling strategies
- training methods:
- We survey the modern training and regularization techniques
- 发现引入其他正则方法的时候降低一点weight decay有好处
- scaling strategies:
- We offer new perspectives and practical advice on scaling
- 可能出现过拟合的时候就加depth,否则先加宽
- resolution慢点增长,more slowly than prior works
- 从acc图可以看到:我们的scaling strategies与网络结构的lightweight change正交,是additive的
- training methods:
- re-scaled ResNets, ResNet-RS
- 仅improve training & scaling strategy就能大幅度涨点
- combine minor architectural changes进一步涨点
- ImageNet上榜大法
方法
architecture
training methods
- match the efficientNet setup
- train for 350 epochs
- use cosine learning rate instead of exponential decay
- RandAugment instead of AutoAugment
- use Momentum optimizer instead of RMSProp
- regularization
- weight decay
- label smoothing
- dropout
- stochastic depth
data augmentation
- we use RandAugment
- EfficientNet use AutoAugment which slightly outperforms RandAugment
hyper:
- match the efficientNet setup
improved training methods
improved scaling strategies
search space
- width multiplier:[0.25, 0.5, 1.0, 1.5, 2.0]
- depth:[26, 50, 101, 200, 300, 350, 400]
- resolution:[128, 160, 224, 320, 448]
- increase regularization as model size increase
- observe 10/100/350 epoch regime
we found that the best scaling strategies depends on training regime
strategy1:scale depth
- Depth scaling outperforms width scaling for longer epoch regimes
- width scaling is preferable for shorter epoch regimes
- scaling width可能会引起overfitting,有时候会hurt performance
- depth scaling引入的参数量也比width小
strategy2:slow resolution scaling
实验
v1.5.2