attention系列

0. 综述

attention的方式分为两种（Reference）
- 学习权重分布
  - 部分加权（hard attention）／全部加权（soft attention）
  - 原图上加权／特征图上加权
  - 空间尺度加权／channel尺度加权／时间域加权／混合域加权
  - CAM系列、SE-block系列：花式加权，学习权重，non-local的模块，作用于某个维度
- 任务分解
  - 设计不同的网络结构（或分支）专注于不同的子任务，
  - 重新分配网络的学习能力，从而降低原始任务的难度，使网络更加容易训练
  - STN、deformable conv：添加显式的模块负责学习形变/receptive field的变化，local模块，apply by pixel
- local / non-local
  - local模块的结果是pixel-specific的
  - non-local模块的结果是全局共同计算的的
基于权重的attention（Reference）
- 注意力机制通常由一个连接在原神经网络之后的额外的神经网络实现
- 整个模型仍然是端对端的，因此注意力模块能够和原模型一起同步训练
- 对于soft attention，注意力模块对其输入是可微的，所以整个模型仍可用梯度方法来优化
- 而hard attention要离散地选择其输入的一部分，这样整个系统对于输入不再是可微的
papers
- [STN] Spatial Transformer Networks
- [deformable conv] Deformable Convolutional Networks
- [CBAM] CBAM: Convolutional Block Attention Module
- [SE-Net] Squeeze-and-Excitation Networks
- [SE-block的一系列变体] SC-SE（for segmentation）、CMPE-SE（复杂又没用）
- [SK-Net] Selective Kernel Networks：是attension module，但是主要改进点在receptive field，trick大杂烩
- [GC-Net] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

CBAM: Convolutional Block Attention Module

动机
- attention module
- lightweight and general
- improvements in classification and detection
论点
- deeper： can obtain richer representation
- increased width：can outperform an extremely deep network
- cardinality：results in stronger representation power than depth and width
- attention：improves the representation of interests
  - humans exploit a sequence of partial glimpses and selectively focus on salient parts
  - Residual Attention Network：computes 3d attention map
  - we decompose the process that learns channel attention and spatial attention separately
  - SE-block：use global average-pooled features
  - we suggest to use max-pooled features as well
方法
- sequentially infers a 1D channel attention map and a 2D spatial attention map
- broadcast and element-wise multiplication
- Channel attention module
  - focuses on ‘what’ is meaningful
  - squeeze the spatial dimension
  - use both average-pooled and max-pooled features simultaneously
  - both descriptors are then forwarded to a shared MLP to reduce dimension
  - 【QUESTION】看论文MLP是线性的吗，没写激活函数
  - then use element-wise summation
  - sigmoid function
- Spatial attention module
  - focuses on ‘where’
  - apply average-pooling and max-pooling along the channel axis and concatenate
  - 7x7 conv
  - sigmoid function
- Arrangement of attention modules
  - in a parallel or sequential manner
  - we found sequential better than parallel
  - we found channel-first order slightly better than the spatial-first
- integration
  - apply CBAM on the convolution outputs in each block
  - in residual path
  - before the add operation
实验
- Ablation studies
  - Channel attention：两个pooling path都有效，一起用最好
  - Spatial attention：1x1conv直接squeeze也行，avg+max更好，7x7conv略好于3x3conv
  - arrangement：前面说了，比SE的单spacial squeeze好，channel在前好于在后，串行好于并行
- Classification results：outperform baselines and SE
- Network Visualization
  - cover the target object regions better
  - the target class scores also increase accordingly
- Object Detection results
  - apply to detectors：right before every classifier
  - apply to backbone

SK-Net: Selective Kernel Networks

动机
- 生物的神经元的感受野是随着刺激变化而变化的
- propose a selective kernel unit
  - adaptively adjust the RF
  - multiple branches with different kernel sizes
  - guided fusion
  - 大杂烩：multi-branch&kernel，group conv，dilated conv，attention mechanism
- SKNet
  - by stacking multiple SK units
  - 在分类任务上验证
论点
- multi-scale aggregation
  - inception block就有了
  - but linear aggregation approach may be insufficient
- multi-branch network
  - two-branch：以resnet为代表，主要是为了easier to train
  - multi-branch：以inception为代表，主要为了得到multifarious features
- grouped/depthwise/dilated conv
  - grouped conv：reduce computation，提升精度
  - depthwise conv：reduce computation，牺牲精度
  - dilated conv：enlarge RF，比dense large kernel节省参数量
- attention mechanism
  - 加权系列：
    - SENet&CBAM：
    - 相比之下SKNet多了adaptive RF
  - 动态卷积系列：
    - STN不好训练，训好以后变换就定死了
    - deformable conv能够在inference的时候也动态的变化变换，但是没有multi-scale和nonlinear aggregation
- thus we propose SK convolution
  - multi-kernels：大size的conv kernel是用了dilated conv
  - nonlinear aggregation
  - computationally lightweight
  - could successfully embedded into small models
  - workflow
    - split
    - fuse
    - select
  - main difference from inception
    - less customized
    - adaptive selection instead of equally addition
方法
- selective kernel convolution
  - split
    - multi-branch with different kernel size
    - grouped/depthwise conv + BN + ReLU
    - 5x5 kernel can be further replaced with dilated conv
  - fuse
    - to learn the control of information flow from different branches
    - element-wise summation
    - global average pooling
    - fc-BN-ReLU：reduce dimension，at least 32
  - select
    - channel-wise weighting factor A & B & more：A+B + more = 1
    - fc-softmax
    - 在2分支的情况下，一个权重矩阵A就够了，B是冗余的，因为可以间接算出来
    - reweighting
- network
  - start from resnext
  - repeated SK units：类似bottleneck
    - 1x1 conv
    - SK conv
    - 1x1 conv
    - hyperparams
      - number of branches M=2
      - group number G=32：cardinality of each path
      - reduction ratio r=16：fuse operator中dim-reduction的参数
  - 嵌入到轻量的网络结构
    - MobileNet/shuffleNet
    - 把其中的3x3 depthwise卷积替换成SK conv
实验
- 比sort的resnet、densenet、resnext精度都要好

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

动机
- Non-Local Network (NLNet)
  - capture long-range dependencies
  - obtain query-specific global context
  - but we found global contexts are almost the same for different query positions
- we produce
  - query-independent formulation
  - smiliar structure as SE-Net
  - aims at global context modeling
论点
- Capturing long-range dependency
  - mainly by sdeeply stacking conv layers：inefficient
  - non-local network
    - via self-attention mechanism
    - computes the pairwise relations between the query position then aggregate
    - 但是不同位置query得到的attention map基本一致
  - we simply the non-local block
    - query-independent
    - maintain acc & save computation
- our proposed GC-block
  - unifies both the NL block and the SE block
  - three steps
    - global context modeling：
    - feature transform module：capture channel-wise interdependency
    - fusion module：merge into the original features
- 多种任务上均有涨点
  - 但都是在跟resnet50对比
revisit NLNet
- non-local block
  - $f(x_i, x_j)$：
    - encodes the relationship between position i & j
    - 计算方式有Gaussian、Embedded Gaussian、Dot product、Concat
    - different instantiations achieve comparable performance
  - $C(x)$：norm factor
  - $x_i + \sum^{N_p} F(x_j)$：aggregates a specific global feature on $x_i$
  - widely-used Embedded Gaussian：
  - 嵌入方式：
    - Mask R-CNN with FPN and Res50
    - only add one non-local block right before the last residual block of res4
  - observations & inspirations
    - distances among inputs show that input features are discriminated
    - outputs & attention maps are almost the same：global context after training is actually independent of query position
    - inspirations
      - simplify the Non-local block
      - no need of query-specific
方法
- simplifying form of NL block：SNL
  - 求一个common的global feature，share给全图每个position
  - 进一步简化：把$x_j$的1x1 conv提到前面，FLOPs大大减少，因为feature scale从HW变成了1x1
  - the SNL block achieves comparable performance to the NL block with significantly lower FLOPs
- global context modeling
  - SNL可以抽象成三部分：
    - global attention pooling：通过$W_k$ & softmax获取attention weights，然后进行global pooling
    - feature transform：1x1 conv
    - feature aggregation：broadcast element-wise add
  - SE-block也可以分解成类似的抽象
    - global attention pooling：用了简单的global average pooling
    - feature transform：用了squeeze & excite的fc-relu-fc-sigmoid
    - feature aggregation：broadcast element-wise multiplication
- Global Context Block
  - integrate the benefits of both
    - SNL global attention pooling：effective modeling on long-range dependency
    - SE bottleneck transform：light computation（只要ratio大于2就会节省参数量和计算量）
  - 特别地，在SE transform的squeeze layer上，又加了BN
    - ease optimization
    - benefit generalization
  - fusion：add
  - 嵌入方式：
    - GC-ResNet50
    - add GC-block to all layers (c3+c4+c5) in resnet50 with se ratio of 16
- relationship to SE-block
  - 首先是fusion method reflects different goals
    - SE基于全局信息rescales the channels，间接使用
    - GC直接使用，将long-range dependency加在每个position上
  - 其次是norm layer
    - ease optimization
  - 最后是global attention pooling
    - SE的GAP是a special case
    - weighting factors shows superior