SE block

综述

图像特征的提取能力是CNN的核心能力,而SE block可以起到为CNN校准采样的作用。

根据感受野理论,特征矩阵主要来自于样本的中央区域,处在边缘位置的酒瓶的图像特征很大概率会被pooling层抛弃掉。而SE block的加入就可以通过来调整特征矩阵,增强酒瓶特征的比重,提高它的识别概率。

  1. [SE-Net] Squeeze-and-Excitation Networks
  2. [SC-SE] Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks
  3. [CMPE-SE] Competitive Inner-Imaging Squeeze and Excitation for Residual Network

SENet: Squeeze-and-Excitation Networks

  1. 动机

    • prior research has investigated the spatial component to achieve more powerful representations
    • we focus on the channel relationship instead
    • SE-block:adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels
    • enhancing the representational power
    • in a computationally efficient manner
  2. 论点

    • stronger network:
      • deeper
      • NiN-like bocks
    • cross-channel correlations in prior work
      • mapped as new combinations of features through 1x1 conv
      • concentrated on the objective of reducing model and computational complexity
    • In contrast, we found this mechanism
      • can ease the learning process
      • and significantly enhance the representational power of the network
    • Attention
      • Attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components
        • Some works provide interesting studies into the combined use of spatial and channel attention
  3. 方法

    • SE-block

      • The channel relationships modelled by convolution are inherently implicit and local

      • we would like to provide it with access to global information

      • squeeze:using global average pooling

      • excitation:nonlinear & non-mutually-exclusive using sigmoid

        • bottleneck:a dimensionality-reduction layer $W_1$ with reduction ratio $r$ and ReLU and a dimensionality-increasing layer $W_2$

        • $s = F_{ex}(z,W) = \sigma (W_2 \delta(W_1 z))$

      • integration

        • insert after the non-linearity following each convolution
        • inception:take the transformation $F_{tr}$ to be an entire Inception module
        • residual:take the transformation $F_{tr}$ to be the non-identity branch of a residual module

    • model and computational complexity

      • ResNet50 vs. SE-ResNet50:0.26% relative increase GFLOPs approaching ResNet10’s accuracy
      • the additional parameters result solely from the two FC layers, among which the final stage FC claims the majority due to being performed across the greatest number of channels
      • the costly final stage of SE blocks could be removed at only a small cost in performance
    • ablations

      • FC
        • removing the biases of the FC layers in the excitation facilitates the modelling of channel dependencies
      • reduction ratio
        • performance is robust to a range of reduction ratios
        • In practice, using an identical ratio throughout a network may not be optimal due to the distinct roles performed by different layers
      • squeeze
        • global average pooling vs. global max pooling:average pooling slightly better
      • excitation
        • Sigmoid vs. ReLU vs. tanh:
          • tanh:slightly worse
          • ReLU:dramatically worse
      • stages
        • each stages brings benefits
        • combination make even better
      • integration strategy
        • fairly robust to their location, provided that they are applied prior to branch aggregation
        • inside the residual unit:fewer channels, fewer parameters, comparable accuracy
    • primitive understanding
      • squeeze
        • the use of global information has a significant influence on the model performance
      • excitation
        • the distribution across different classes is very similar at the earlier layers (general features)
        • the value of each channel becomes much more class-specific at greater depth
        • SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to one
        • SE_5_3 exhibits a similar pattern emerges over different classes, up to a modest change in scale
        • suggesting that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network (thus can be removed)
  4. APPENDIX

    • 在ImageNet上SOTA的模型是SENet-154,top1-err是18.68,被标记在了efficientNet论文的折线图上
      • SE-ResNeXt-152(64x4d)
        • input=(224,224):top1-err是18.68
        • input=320/299:top1-err是17.28
      • further difference
        • each bottleneck building block的第一个1x1 convs的通道数减半
        • stem的第一个7x7conv换成了3个连续的3x3 conv
        • 1x1的s2 conv换成了3x3的s2 conv
        • fc之前添加dropout layer
        • label smoothing
        • 最后几个training epoch将BN层的参数冻住,保证训练和测试的参数一致
        • 64 GPUs,batch size=2048(32 per GPU)
        • initial lr=1.0

SC-SE: Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks

  1. 动机

    • image segmentation task 上面SE-Net提出来主要是针对分类
    • three variants of SE modules
      • squeezing spatially and exciting channel-wise (cSE)
      • squeezing channel-wise and exciting spatially (sSE)
      • concurrent spatial and channel squeeze & excitation (scSE)
    • integrate within three different state-of-the- art F-CNNs (DenseNet, SD-Net, U-Net)
  2. 论点

    • F-CNNs have become the tool of choice for many image segmentation tasks
    • core:convolutions that capturing local spatial pattern along all input channels jointly
    • SE block factors out the spatial dependency by global average pooling to learn a channel specific descriptor (later refered to as cSE /channel-SE)
    • while for image segmentation, we hypothesize that the pixel-wise spatial information is more informative
    • thus we propose sSE(spatial SE) and scSE(spatial and channel SE)
    • can be seamlessly integrated by placing after every encoder and decoder block

  3. 方法

    • cSE
      • GAP:embeds the global spatial information into a vector
      • FC-ReLU-FC-Sigmoid:adaptively learns the importance
      • recalibrate
    • sSE
      • 1x1 conv:generating a projection tensor representing the linearly combined representation for all channels C for a spatial location (i,j)
      • Sigmoid:rescale
      • recalibrate
    • scSE
      • by element-wise addition
      • encourages the network to learn more meaningful feature maps ———- relevant both spatially and channel-wise
  4. 实验

    • F-CNN architectures:

      • 4 encoder blocks, one bottleneck layer, 4 decoder blocks and a classification layer
      • class imbalance:median frequency balancing ce
    • dice cmp:scSE > sSE > cSE > vanilla

    • 小区域类别的分割,观察到使用cSE可能会差于vanilla: might have got overlooked by only exciting the channels

    • 定性分析:

      • 一些under segmented的地方,scSE improves with the inclusion
      • 一些over segmented的地方,scSE rectified the result

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

  1. 动机
    • for residual network
    • the residual architecture has been proved to be diverse and redundant
    • model the competition between residual and identity mappings
    • make the identity flow to control the complement of the residual feature maps
  2. 论点

    • For analysis of ResNet, with the increase in depth, the residual network exhibits a certain amount of redundancy
    • with the CMPE-SE mechanism, it makes residual mappings tend to provide more efficient supplementary for identity mappings

  3. 方法

    • 主要提出了三种变体:

    • 第一个变体:

      • 两个分支id和res分别GAP出一个vector,然后fc reduct by ratio,然后concat,然后channel back
      • Implicitly, we can believe that the winning of the identity channels in this competition results in less weights of the residual channels
    • 第二个变体:

      • 两种方案

        • 2x1 convs:对上下相应位置的元素求avg
        • 1x1 convs:对全部元素求avg,然后flatten

    • 第三个变体:

      • 两边的channel-wise vector叠起来,然后reshape成矩阵形式,然后3x3 conv,然后flatten

比较扯,不浪费时间分析了。