SE block

综述

图像特征的提取能力是CNN的核心能力，而SE block可以起到为CNN校准采样的作用。

根据感受野理论，特征矩阵主要来自于样本的中央区域，处在边缘位置的酒瓶的图像特征很大概率会被pooling层抛弃掉。而SE block的加入就可以通过来调整特征矩阵，增强酒瓶特征的比重，提高它的识别概率。

[SE-Net] Squeeze-and-Excitation Networks
[SC-SE] Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks
[CMPE-SE] Competitive Inner-Imaging Squeeze and Excitation for Residual Network

SENet: Squeeze-and-Excitation Networks

动机
- prior research has investigated the spatial component to achieve more powerful representations
- we focus on the channel relationship instead
- SE-block：adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels
- enhancing the representational power
- in a computationally efficient manner
论点
- stronger network：
  - deeper
  - NiN-like bocks
- cross-channel correlations in prior work
  - mapped as new combinations of features through 1x1 conv
  - concentrated on the objective of reducing model and computational complexity
- In contrast, we found this mechanism
  - can ease the learning process
  - and significantly enhance the representational power of the network
- Attention
  - Attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components
    - Some works provide interesting studies into the combined use of spatial and channel attention
方法
- SE-block
  - The channel relationships modelled by convolution are inherently implicit and local
  - we would like to provide it with access to global information
  - squeeze：using global average pooling
  - excitation：nonlinear & non-mutually-exclusive using sigmoid
    - bottleneck：a dimensionality-reduction layer $W_1$ with reduction ratio $r$ and ReLU and a dimensionality-increasing layer $W_2$
    - $s = F_{ex}(z,W) = \sigma (W_2 \delta(W_1 z))$
  - integration
    - insert after the non-linearity following each convolution
    - inception：take the transformation $F_{tr}$ to be an entire Inception module
    - residual：take the transformation $F_{tr}$ to be the non-identity branch of a residual module
- model and computational complexity
  - ResNet50 vs. SE-ResNet50：0.26% relative increase GFLOPs approaching ResNet10’s accuracy
  - the additional parameters result solely from the two FC layers, among which the final stage FC claims the majority due to being performed across the greatest number of channels
  - the costly final stage of SE blocks could be removed at only a small cost in performance
- ablations
  - FC
    - removing the biases of the FC layers in the excitation facilitates the modelling of channel dependencies
  - reduction ratio
    - performance is robust to a range of reduction ratios
    - In practice, using an identical ratio throughout a network may not be optimal due to the distinct roles performed by different layers
  - squeeze
    - global average pooling vs. global max pooling：average pooling slightly better
  - excitation
    - Sigmoid vs. ReLU vs. tanh：
      - tanh：slightly worse
      - ReLU：dramatically worse
  - stages
    - each stages brings benefits
    - combination make even better
  - integration strategy
    - fairly robust to their location, provided that they are applied prior to branch aggregation
    - inside the residual unit：fewer channels, fewer parameters, comparable accuracy
- primitive understanding
  - squeeze
    - the use of global information has a significant influence on the model performance
  - excitation
    - the distribution across different classes is very similar at the earlier layers (general features)
    - the value of each channel becomes much more class-specific at greater depth
    - SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to one
    - SE_5_3 exhibits a similar pattern emerges over different classes, up to a modest change in scale
    - suggesting that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network (thus can be removed)
APPENDIX
- 在ImageNet上SOTA的模型是SENet-154，top1-err是18.68，被标记在了efficientNet论文的折线图上
  - SE-ResNeXt-152（64x4d）
    - input=(224,224)：top1-err是18.68
    - input=320/299：top1-err是17.28
  - further difference
    - each bottleneck building block的第一个1x1 convs的通道数减半
    - stem的第一个7x7conv换成了3个连续的3x3 conv
    - 1x1的s2 conv换成了3x3的s2 conv
    - fc之前添加dropout layer
    - label smoothing
    - 最后几个training epoch将BN层的参数冻住，保证训练和测试的参数一致
    - 64 GPUs，batch size=2048（32 per GPU）
    - initial lr=1.0

SC-SE: Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks

动机
- image segmentation task 上面SE-Net提出来主要是针对分类
- three variants of SE modules
  - squeezing spatially and exciting channel-wise (cSE)
  - squeezing channel-wise and exciting spatially (sSE)
  - concurrent spatial and channel squeeze & excitation (scSE)
- integrate within three different state-of-the- art F-CNNs (DenseNet, SD-Net, U-Net)
论点
- F-CNNs have become the tool of choice for many image segmentation tasks
- core：convolutions that capturing local spatial pattern along all input channels jointly
- SE block factors out the spatial dependency by global average pooling to learn a channel specific descriptor (later refered to as cSE /channel-SE)
- while for image segmentation, we hypothesize that the pixel-wise spatial information is more informative
- thus we propose sSE(spatial SE) and scSE(spatial and channel SE)
- can be seamlessly integrated by placing after every encoder and decoder block
方法
- cSE
  - GAP：embeds the global spatial information into a vector
  - FC-ReLU-FC-Sigmoid：adaptively learns the importance
  - recalibrate
- sSE
  - 1x1 conv：generating a projection tensor representing the linearly combined representation for all channels C for a spatial location (i,j)
  - Sigmoid：rescale
  - recalibrate
- scSE
  - by element-wise addition
  - encourages the network to learn more meaningful feature maps ———- relevant both spatially and channel-wise
实验
- F-CNN architectures：
  - 4 encoder blocks, one bottleneck layer, 4 decoder blocks and a classification layer
  - class imbalance：median frequency balancing ce
- dice cmp：scSE > sSE > cSE > vanilla
- 小区域类别的分割，观察到使用cSE可能会差于vanilla： might have got overlooked by only exciting the channels
- 定性分析：
  - 一些under segmented的地方，scSE improves with the inclusion
  - 一些over segmented的地方，scSE rectified the result

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

动机
- for residual network
- the residual architecture has been proved to be diverse and redundant
- model the competition between residual and identity mappings
- make the identity flow to control the complement of the residual feature maps
论点
- For analysis of ResNet, with the increase in depth, the residual network exhibits a certain amount of redundancy
- with the CMPE-SE mechanism, it makes residual mappings tend to provide more efficient supplementary for identity mappings
方法
- 主要提出了三种变体：
- 第一个变体：
  - 两个分支id和res分别GAP出一个vector，然后fc reduct by ratio，然后concat，然后channel back
  - Implicitly, we can believe that the winning of the identity channels in this competition results in less weights of the residual channels
- 第二个变体：
  - 两种方案
    - 2x1 convs：对上下相应位置的元素求avg
    - 1x1 convs：对全部元素求avg，然后flatten
- 第三个变体：
  - 两边的channel-wise vector叠起来，然后reshape成矩阵形式，然后3x3 conv，然后flatten

比较扯，不浪费时间分析了。