Hausdorff Distance

发表于 2020-05-14 |

Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks

动机
- novel loss function to reduce HD directly
- propose three methods
- 2D&3D，ultra & MR & CT
- lead to approximately 18 − 45% reduction in HD without degrading other segmentation performance criteria
论点
- HD is one of the most informative and useful criteria because it is an indicator of the largest segmentation error
- current segmentation algorithms rarely aim at minimizing or reducing HD directly
  - HD is determined solely by the largest error instead of the overall segmentation performance
  - HD‘s sensitivity to noise and outliers —> modified version
  - the optimization diffculty
- thus we propose an “HD- inspired” loss function
方法
- denotations
  - probability：$q$
  - binary mask：$\bar p$、$\bar q$
  - boundary：$\delta p$、$\delta q$
  - single hd：$hd(\bar p, \bar q)$、$hd(\bar q, \bar p)$
- based on distance transforms
  - distance map $d_p$：define the distance map as the unsigned distance to the boundary $\delta p$
    $DT_X[i,j] = min_{[k,l]\in X}d([i,j], [k,l])$
    距离场定义为：每个点到目标区域(X)的距离的最小值
  - HD based on DT：
    - finally have： $HD_{DT}(\delta p, \delta q) = max(hd_{DT}(\delta p, \delta q), hd_{DT}(\delta q, \delta p))$
  - modified loss version of HD：
    - penalizely focus on areas instead of single point
    - $\alpha$ determines how strongly we penalize larger errors
    - use possibility instead of thresholded value
    - use $(p-q)^2$ instead of $|p-q|$
  - correlations
    - $HD_{DT}$：Pearson correlation coefficient above 0.99
    - $Loss_{DT}$：Pearson correlation coefficient above 0.93
  - drawback
    - high computational cost especially in 3D
    - $q$ changes along with training process thus $d_q$ changes while $d_p$ remains
    - modified one-sided HD (OS)：
      $Loss_{DT-OS}(q,p) = \frac{1}{|\Omega|}\sum_{\Omega}((p-q)^2\circ(d_p^{\alpha}))$
- HD using Morphological Operations
  - morphological erosion：
    $S \ominus B = \{z\in \Omega | B(z) \subseteq S\}$
    腐蚀操作定义为：在原始二值化图的前景区域，以每个像素为中心点，run structure element block B，如果B完全在原图内，则当前中心点在腐蚀后也是前景。
  - HD based on erosion：
    - $HD_{ER}$ is a lower bound of the true value
    - can be computed more efficiently using convolutional operations
  - modifid loss version：
    - k successive erosions
    - cross-shaped kernel whose elements sum to one followed by a soft thresholding at 0.50
  - correlations
    - $HD_{ER}$：Pearson correlation coefficient above 0.91
    - $Loss_{ER}$：Pearson correlation coefficient above 0.83
- HD using circular-shaped convolutional kernel
  - circular-shaped kernel
  - HD based on circular-shaped kernel：
    - $\bar p^C$：complement 补集
    - $f_h$：hard thresholding setting all values below 1 to zero
  - modified loss version：
    - soft thresholding
    - $f_{\bar p\backslash \bar q} = (p-q)^2*p$
  - correlations
    - $HD_{CV}$：Pearson correlation coefficient above 0.99
    - $Loss_{CV}$：Pearson correlation coefficient above 0.88
  - computation：
    - kernel size
      - $HD_{ER}$ is computed using small fixed convolutional kernels (of size 3)
      - $Loss_{CV}$ require applying filters of increasing size(we use a maximum kernel radius of 18 pixels in 2D and 9 voxels in 3D)
    - steps
      - choose R based on the expected range of segmentation errors
      - set R = {3, 6, . . . 18} for 2D images and R = {3,6,9} for 3D
- training
  - standard Unet
  - augment our HD-based loss term with a DSC loss term for more stable training
  - reweight both loss after every epoch

SE block

发表于 2020-04-30 |

综述

图像特征的提取能力是CNN的核心能力，而SE block可以起到为CNN校准采样的作用。

根据感受野理论，特征矩阵主要来自于样本的中央区域，处在边缘位置的酒瓶的图像特征很大概率会被pooling层抛弃掉。而SE block的加入就可以通过来调整特征矩阵，增强酒瓶特征的比重，提高它的识别概率。

[SE-Net] Squeeze-and-Excitation Networks
[SC-SE] Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks
[CMPE-SE] Competitive Inner-Imaging Squeeze and Excitation for Residual Network

SENet: Squeeze-and-Excitation Networks

动机
- prior research has investigated the spatial component to achieve more powerful representations
- we focus on the channel relationship instead
- SE-block：adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels
- enhancing the representational power
- in a computationally efficient manner
论点
- stronger network：
  - deeper
  - NiN-like bocks
- cross-channel correlations in prior work
  - mapped as new combinations of features through 1x1 conv
  - concentrated on the objective of reducing model and computational complexity
- In contrast, we found this mechanism
  - can ease the learning process
  - and significantly enhance the representational power of the network
- Attention
  - Attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components
    - Some works provide interesting studies into the combined use of spatial and channel attention
方法
- SE-block
  - The channel relationships modelled by convolution are inherently implicit and local
  - we would like to provide it with access to global information
  - squeeze：using global average pooling
  - excitation：nonlinear & non-mutually-exclusive using sigmoid
    - bottleneck：a dimensionality-reduction layer $W_1$ with reduction ratio $r$ and ReLU and a dimensionality-increasing layer $W_2$
    - $s = F_{ex}(z,W) = \sigma (W_2 \delta(W_1 z))$
  - integration
    - insert after the non-linearity following each convolution
    - inception：take the transformation $F_{tr}$ to be an entire Inception module
    - residual：take the transformation $F_{tr}$ to be the non-identity branch of a residual module
- model and computational complexity
  - ResNet50 vs. SE-ResNet50：0.26% relative increase GFLOPs approaching ResNet10’s accuracy
  - the additional parameters result solely from the two FC layers, among which the final stage FC claims the majority due to being performed across the greatest number of channels
  - the costly final stage of SE blocks could be removed at only a small cost in performance
- ablations
  - FC
    - removing the biases of the FC layers in the excitation facilitates the modelling of channel dependencies
  - reduction ratio
    - performance is robust to a range of reduction ratios
    - In practice, using an identical ratio throughout a network may not be optimal due to the distinct roles performed by different layers
  - squeeze
    - global average pooling vs. global max pooling：average pooling slightly better
  - excitation
    - Sigmoid vs. ReLU vs. tanh：
      - tanh：slightly worse
      - ReLU：dramatically worse
  - stages
    - each stages brings benefits
    - combination make even better
  - integration strategy
    - fairly robust to their location, provided that they are applied prior to branch aggregation
    - inside the residual unit：fewer channels, fewer parameters, comparable accuracy
- primitive understanding
  - squeeze
    - the use of global information has a significant influence on the model performance
  - excitation
    - the distribution across different classes is very similar at the earlier layers (general features)
    - the value of each channel becomes much more class-specific at greater depth
    - SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to one
    - SE_5_3 exhibits a similar pattern emerges over different classes, up to a modest change in scale
    - suggesting that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network (thus can be removed)
APPENDIX
- 在ImageNet上SOTA的模型是SENet-154，top1-err是18.68，被标记在了efficientNet论文的折线图上
  - SE-ResNeXt-152（64x4d）
    - input=(224,224)：top1-err是18.68
    - input=320/299：top1-err是17.28
  - further difference
    - each bottleneck building block的第一个1x1 convs的通道数减半
    - stem的第一个7x7conv换成了3个连续的3x3 conv
    - 1x1的s2 conv换成了3x3的s2 conv
    - fc之前添加dropout layer
    - label smoothing
    - 最后几个training epoch将BN层的参数冻住，保证训练和测试的参数一致
    - 64 GPUs，batch size=2048（32 per GPU）
    - initial lr=1.0

SC-SE: Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks

动机
- image segmentation task 上面SE-Net提出来主要是针对分类
- three variants of SE modules
  - squeezing spatially and exciting channel-wise (cSE)
  - squeezing channel-wise and exciting spatially (sSE)
  - concurrent spatial and channel squeeze & excitation (scSE)
- integrate within three different state-of-the- art F-CNNs (DenseNet, SD-Net, U-Net)
论点
- F-CNNs have become the tool of choice for many image segmentation tasks
- core：convolutions that capturing local spatial pattern along all input channels jointly
- SE block factors out the spatial dependency by global average pooling to learn a channel specific descriptor (later refered to as cSE /channel-SE)
- while for image segmentation, we hypothesize that the pixel-wise spatial information is more informative
- thus we propose sSE(spatial SE) and scSE(spatial and channel SE)
- can be seamlessly integrated by placing after every encoder and decoder block
方法
- cSE
  - GAP：embeds the global spatial information into a vector
  - FC-ReLU-FC-Sigmoid：adaptively learns the importance
  - recalibrate
- sSE
  - 1x1 conv：generating a projection tensor representing the linearly combined representation for all channels C for a spatial location (i,j)
  - Sigmoid：rescale
  - recalibrate
- scSE
  - by element-wise addition
  - encourages the network to learn more meaningful feature maps ———- relevant both spatially and channel-wise
实验
- F-CNN architectures：
  - 4 encoder blocks, one bottleneck layer, 4 decoder blocks and a classification layer
  - class imbalance：median frequency balancing ce
- dice cmp：scSE > sSE > cSE > vanilla
- 小区域类别的分割，观察到使用cSE可能会差于vanilla： might have got overlooked by only exciting the channels
- 定性分析：
  - 一些under segmented的地方，scSE improves with the inclusion
  - 一些over segmented的地方，scSE rectified the result

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

动机
- for residual network
- the residual architecture has been proved to be diverse and redundant
- model the competition between residual and identity mappings
- make the identity flow to control the complement of the residual feature maps
论点
- For analysis of ResNet, with the increase in depth, the residual network exhibits a certain amount of redundancy
- with the CMPE-SE mechanism, it makes residual mappings tend to provide more efficient supplementary for identity mappings
方法
- 主要提出了三种变体：
- 第一个变体：
  - 两个分支id和res分别GAP出一个vector，然后fc reduct by ratio，然后concat，然后channel back
  - Implicitly, we can believe that the winning of the identity channels in this competition results in less weights of the residual channels
- 第二个变体：
  - 两种方案
    - 2x1 convs：对上下相应位置的元素求avg
    - 1x1 convs：对全部元素求avg，然后flatten
- 第三个变体：
  - 两边的channel-wise vector叠起来，然后reshape成矩阵形式，然后3x3 conv，然后flatten

比较扯，不浪费时间分析了。

cv2&numpy&tobeadded

发表于 2020-04-19 |

矩阵乘法
- np.dot(A,B)：真正的矩阵乘法
- np.multiply(A,B) & np重载的*：element-wise product，矩阵中对应元素相乘
- cv的A.dot(B) & cv重载的*：真正的矩阵乘法
- cv的A.mul(B) ：element-wise product，矩阵中对应元素相乘

图像旋转

通过仿射矩阵cv2.getRotationMatrix2D和仿射变换函数cv2.warpAffine来实现

src：输入图像
M：变换矩阵
dsize：输出图像的大小（基于图像原点裁剪）
flags：插值方法
borderMode：边界像素模式
borderValue：边界填充值，默认为0

cv2.getRotationMatrix2D(center, angle, scale)：返回一个2x3的变换矩阵
center：旋转中心
angle：旋转角度，正值是逆时针旋转
scale：缩放因子

cv2.warpAffine(src, M, dsize[, dst[, flags[, borderMode[, borderValue]]]]))：返回变换后的图像
src：输入图像
M：变换矩阵
dsize：输出图像的大小（基于图像原点裁剪）
flags：插值方法
borderMode：边界像素模式

borderValue：边界填充值，默认为0

def rotate_img(angle, img, interpolation=cv2.INTER_LINEAR, points=[]):
    h, w = img.shape
    rotataMat = cv2.getRotationMatrix2D((w/2, h/2), math.degrees(angle), 1)
    # rotate_img1: 输出图像尺寸不变，超出原图像部分被cut掉
    rotate_img1 = cv2.warpAffine(img, rotataMat, dsize=(w, h), flags=interpolation, borderMode=cv2.BORDER_CONSTANT, borderValue=0)
   	# rotate_img2: 输出图像尺寸变大，保留超出原图像部分，新的坐标原点保证旋转中心仍旧位于图像中心
    new_h = int(w*math.fabs(math.sin(angle)) + h*math.fabs(math.cos(angle)))
    new_w = int(h*math.fabs(math.sin(angle)) + w*math.fabs(math.cos(angle)))
    rotataMat[0, 2] += (new_w - w) / 2
    rotataMat[1, 2] += (new_h - h) / 2
    rotate_img2 = cv2.warpAffine(img, rotataMat, dsize=(new_w, new_h), flags=interpolation, borderMode=cv2.BORDER_CONSTANT, borderValue=0)
    
    # 坐标点的变换
    rotated_points = []
    for point in points:
        point = rotataMat.dot([[point[0]], [point[1]], [1]])
        rotated_points.append((int(point[0]), int(point[1])))

    return rotate_img2, rotated_points

使用tips：

如果不修改仿射变换矩阵的平移参数，坐标原点的位置不发生改变
dsize指定的输出图像是从原点位置开始裁剪
坐标点的变换满足公式：
$dst(x,y) = src(M_{11}x+M_{12}y+M_{13}, M_{21}x+M_{22}y+M_{23})$

np.meshgrid(*xi,**kwargs)

这个函数神他妈坑，作用是Return coordinate matrices from coordinate vectors. Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,…, xn. 但是尝试一下会发现：

x = np.arange(0,10,1)
y = np.arange(0,20,1)
z = np.arange(0,30,1)
x, y, z= np.meshgrid(x, y, z)
print(x.shape)       # (20, 10, 30)

xy轴坐标是反过来的，这是因为optional args里面有一个indexing：

indexing : {‘xy’, ‘ij’}, Cartesian (‘xy’, default) or matrix (‘ij’) indexing of output.

我们想要得到的坐标系和输入的轴一一对应，得指定参数indexing='ij'

x = np.arange(0,10,1)
y = np.arange(0,20,1)
z = np.arange(0,30,1)
x, y, z= np.meshgrid(x, y, z, indexing='ij')
print(x.shape)      # (10, 20, 30)

还有一个参数sparse，因为每根轴的坐标都是复制的，所以可以稀疏存储，此时函数返回值变化：

sparse : bool, If True a sparse grid is returned in order to conserve memory. Default is False.

x = np.arange(0,10,1)
y = np.arange(0,20,1)

xx, yy = np.meshgrid(x, y)
print(xx)        # a 20x10 list

xx, yy = np.meshgrid(x, y, sparse=True)
print(xx)        # a 1*10 list
print(yy)        # a 20*1 list
# 所以整体上还是个20*10的矩阵

二维可视化：

import matplotlib.pyplot as plt

z = xx**2 + yy**2            # xx和yy既可以是dense convervation也可以是sparse convervation
h = plt.contourf(x,y,z)
plt.show()

np.tile(A,reps)

这个函数挺有用的，把数组沿着指定维度复制，比stack、concat啥的都优雅，能自动创建新的维度
- A：array_like, The input array.
- reps：array_like, The number of repetitions of A along each axis.

np.reshape(a, newshape, order=’C’)

这个函数贼常用，但是一般用于二维的时候没考虑重组顺序这件事

order: {‘C’, ‘F’, ‘A’}, optional，简单理解，reshape的通用实现方式是先将真个array拉直，然后依次取数据填入指定维度，C是从最里面的维度开始拉直&构造，F是从最外面的维度开始拉直&构造，A for auto

a = np.arange(6)
array([0, 1, 2, 3, 4, 5])

# C-like index ordering
np.reshape(a, (2, 3))
array([[0, 1, 2],
       [3, 4, 5]])

# Fortran-like index ordering
np.reshape(a, (2, 3), order='F')
array([[0, 4, 3],
       [2, 1, 5]])

tf和keras里面也有reshape，是没有order参数的，默认是’C’

MobileNets

发表于 2020-04-16 |

preview

动机
- 计算力有限
- 模型压缩／使用小模型
深度可分离卷积 Depthwise Separable Convolution
- 将标准卷积拆分为两个操作：深度卷积(depthwise convolution) 和逐点卷积(pointwise convolution)
- 标准卷积：参数量k*k*input_channel*output_channel
- 深度卷积(depthwise convolution) ：针对每个输入通道采用不同的卷积核，参数量k*k*input_channel
- 逐点卷积(pointwise convolution)：就是普通的卷积，只不过其采用1x1的卷积核，参数量1*1*input_channel*output_channel
- with BN and ReLU：
- DW没有改变通道数的能力，如果输入层的通道数很少，DW也只能在低维空间提特征，因此V2提出先对原始输入做expansion，用一个非线性PW升维，然后DW，然后再使用一个PW降维，值得注意的是，第二个PW不使用非线性激活函数，因为作者认为，relu作用在低维空间上会导致信息损失。
进一步缩减计算量
- 通道数缩减：宽度因子 alpha
- 分辨率缩减：分辨率因子rho
papers
- [V1 CVPR2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications，Google，主要贡献Depthwise Separable Convolution
- [V2 CVPR2018] MobileNetV2: Inverted Residuals and Linear Bottlenecks，Google，主要贡献inverted residual with linear bottleneck
- [V3 ICCV2019] Searching for MobileNetV3，Google，模型结构升级多了SE(inverted-res-block + SE-block)，是通过NAS而非手动设计
- [EfficientNet-lite ICML2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks，efficientNet家族的scale-down版本，原始的EfficientNet是基于Mobile3的basic block，而Lite版本有很多专供移动端的改动：去掉SE、改用RELU6、
- [MobileOne 2022] An Improved One millisecond Mobile Backbone，Apple，

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

动机
- efficient models：uses depthwise separable convolutions and two simple global hyper-parameters
- resource and accuracy tradeoffs
- a class of network architectures that allows a model developer to specifically choose a small network that matches the resource restrictions (latency, size) for their application
论点：
- the general trend has been to make deeper and more complicated networks in order to achieve higher accuracy
- not efficient on computationally limited platform
- building small and efficient neural networks：either compressing pretrained networks or training small networks directly
- Many papers on small networks focus only on size but do not consider speed
- speed & size 不完全对等
- size：depthwise separable convolutions, bottleneck approaches, compressing pretrained networks, distillation
方法
- depthwise separable convolutions
  - a form of factorized convolutions：a standard conv splits into 2 layers
  - factorize the filtering and combination steps of standard conv
  - drastically reducing computation and model size to $\frac{1}{N} + \frac{1}{D_k^2}$
  - use both batchnorm and ReLU nonlinearities for both layers
  - MobileNet uses 3 × 3 depthwise separable convolutions which bring between 8 to 9 times less computation
- MobileNet
  - the first layer is a full convolution, the rest depthwise separable convolutions
  - down sampling is handled with strided convolution
  - all layers are followed by a BN and ReLU nonlinearity
  - a final average pooling reduces the spatial resolution to 1 before the fully connected layer.
  - the final fully connected layer has no nonlinearity and feeds into a softmax layer for classification
- training so few parameters
  - RMSprop
  - less regularization and data augmentation techniques because small models have less trouble with overfitting
  - it was important to put very little or no weight decay (l2 regularization)
  - do not use side heads or label smoothing or image distortions
- Width Multiplier: Thinner Models
  - thin a network uniformly at each layer
  - the input channels $M$ and output channels $N$ becomes $\alpha M$ and $\alpha N$
  - $\alpha=1$：baseline MobileNet $\alpha<1$：reduced MobileNet
  - reduce the parameters roughly by $\alpha^2$
- Resolution Multiplier: Reduced Representation
  - apply this to the input image
  - the input resolution of the network is typically 224, 192, 160 or 128
  - $\rho=1$：baseline MobileNet $\rho<1$：reduced MobileNet
  - reduce the parameters roughly by $\rho^2$
结论
- using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet but saving tremendously on mult-adds and parameters
- at similar computation and number of parameters, thinner MobileNets is 3% better than making them shallower
- trade-offs based on the two hyper-parameters

MobileNetV2: Inverted Residuals and Linear Bottlenecks

动机
- a new mobile architecture
  - based on an inverted residual structure
  - remove non-linearities in the narrow layers in order to maintain representational power
- prove on multiple tasks
  - object detection：SSDLite
  - semantic segmentation：Mobile DeepLabv3
方法
- Depthwise Separable Convolutions
  - replace a full convolutional opera- tor with a factorized version
  - depthwise convolution, it performs lightweight filtering per input channel
  - pointwise convolution, computing linear combinations of the input channels
- Linear Bottlenecks
  - ReLU results in information loss in lower dimension space
  - expansion ratio：if we have lots of channels, information might still be preserved in the other channels
  - linear：bottleneck上面不包含非线性激活单元
- Inverted residuals
  - bottlenecks actually contain all the necessary information
  - expansion layer acts merely as an implementation detail that accompanies a non-linear transformation
  - parameter count：
  - basic building block is a bottleneck depth-separable convolution with residuals

* interpretation 

    * provides a natural separation between the input/output
    * expansion：capacity
    * layer transformation：expressiveness

* MobileNetV2 model architecture

    * initial filters：32
    * ReLU6：use ReLU6 as the non-linearity because of its robustness when used with low-precision computation  
    * use constant expansion rate between 5 and 10 except the 1st：smaller network inclines smaller and larger larger

    <img src="MobileNets/MobileNetV2.png" width="40%" />

    * comparison with other architectures

    <img src="MobileNets/cmpV2.png" width="40%" />

实验
- Object Detection
  - evaluate the performance as feature extractors
  - replace all the regular convolutions with separable convolutions in SSD prediction layers：backbone没有改动，只替换头部的卷积，降低计算量
  - achieves competitive accuracy with significantly fewer parameters and smaller computational complexity
- Semantic Segmentation
  - build DeepLabv3 heads on top of the second last feature map of MobileNetV2
  - DeepLabv3 heads are computationally expensive and removing the ASPP module significantly reduces the MAdds
- ablation
  - inverted residual connections：shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers 在少通道的特征上进行短连接
  - linear bottlenecks：linear bottlenecks improve performance, providing support that non-linearity destroys information in low-dimensional space

Searching for MobileNetV3

动机
- automated search algorithms and network design work together
- classification & detection & segmentation
- a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP)
- new efficient versions of nonlinearities
论点
- reducing
  - the number of parameters
  - the number of operations (MAdds)
  - inference latency
- related work
  - SqueezeNet：1x1 convolutions
  - MobileNetV1：separable convolution
  - MobileNetV2：inverted residuals
  - ShuffleNet：group convolutions
  - CondenseNet：group convolutions
  - ShiftNet：shift operation
  - MnasNet：MobileNetV2+SE-block，attention modules are placed after the depthwise filters in the expansion
方法
- base blocks
  - combination of ideas from [MobileNetV1, MobileNetV2, MnasNet]
  - inverted-res-block + SE-block
  - swish nonlinearity
  - hard sigmoid
- Network Search
  - use platform-aware NAS to search for the global network structures
  - use the NetAdapt algorithm to search per layer for the number of filters
- Network Improvements
  - redesign the computionally-expensive layers at the beginning and the end of the network
    - the last block of MobileNetV2’s inverted bottleneck structure
    - move this layer past the final average pooling：移动到GAP后面去，作用在1x1的featuremap上instead of 7x7，曲线救国
  - a new nonlinearity, h-swish
    - the initial set of filters are also expensive：usually start with 32 filters in a full 3x3 convolution to build initial filter banks for edge detection
    - reduce the number of filters to 16 and use the hard swish nonlinearity
      $swish\ [x]=x*\sigma(x)\\ h\_swish\ [x]=x\frac{ReLU6(x+3)}{6}$
    - most of the benefits swish are realized by using them only in the deeper layers：只在后半段网络中用
  - SE-block
    - ratio：all to fixed to be 1/4 of the number of channels in expansion layer
- MobileNetV3 architecture
实验
- Detection
  - use MobileNetV3 as replacement for the backbone feature extractor in SSDLite：改做backbone了
  - reduce the channel counts of C4&C5’s block：因为MobileNetV3原本是被用来输出1000类的，transfer到90类的coco数据集上有些redundant
- Segmentation
  - as network backbone
  - compare two segmentation heads
    - R-ASPP：reduced design of the Atrous Spatial Pyramid Pooling module with only two branches
    - Lite R-ASPP：类SE-block的设计，大卷积核，大步长

EfficientNet-lite

没有专门的paper
- 基于efficientNet向移动端改进
- https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite
- https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html
model zoo

| model | width | depth | resolution | droprate |
| ————————— | ——- | ——- | ————— | ———— |
| efficientnet-lite0 | 1. | 1. | 224 | .2 |
| efficientnet-lite1 | 1. | 1.1 | 240 | .2 |
| efficientnet-lite2 | 1.1 | 1.2 | 260 | .3 |
| efficientnet-lite3 | 1.2 | 1.4 | 280 | .3 |
| efficientnet-lite4 | 1.4 | 1.8 | 300 | .3 |
| | | | | |
- 关键数据：
  - lite4精度可以达到80.4%，同时保持在Pixel 4 CPU上real-time运行：30ms/image
  - latency：10-30ms
  - model size：5M-15M

<img src="MobileNets/lite-size.png" width="40%;" />

3. challenges

    * Quantization量化：移动端设备支持的浮点精度有限——训练后量化，将浮点模型tf model转化成tfLite model（全整数int8/半浮点float16），
    * Heterogeneous hardware移动端设备参差不齐：好多操作不支持，尽量替换成底层支持的op

4. modifications

    * Removed squeeze-and-excitation networks：去掉SE，not well supported

    * swish替换成RELU6：有利于post-training quantization

        * 一开始将浮点替换成整型的时候观察到huge acc drop：75 -> 48
        * 发现是因为浮点太wide-ranged了，直接映射到int8太多精度损失
        * 所以替换成激活区间有限的relu6 [0,6]

        <img src="MobileNets/int8.png" width="40%;" /><img src="MobileNets/quantization.png" width="40%;" />

    * Fixed the stem and head while scaling models up：stem的resolution大，head的channel大，scaleup都对参数量/计算量影响比较大，只scaleup中间的stage

MobileOne: An Improved One millisecond Mobile Backbone

动机
- FLOPs & 参数量等指标并不直接和移动端latency相关
- this paper
  - 调研了各种mobileNets的优化瓶颈：architectural and optimization bottlenecks
  - 提出了MobileOne
- 精度
  - 低于1ms/image的速度，top 1 acc 75.9%
  - 上面的eff-lite0要10ms，top 1 acc 74.+%，涨点2.3%
  - Mobile- Former精度近似，速度要快38x
论点
- previous methods
  - 大部分foucs on 优化FLOPs
  - 而且会引入new architecture designs & custom layers，如hard-swish，这在移动端通常不原生支持
- existing metric & latency
  - FLOPs does not account for memory cost and degree of parallelism
  - sharing parameters leads to higher FLOPS but smaller model size
  - skip-connections / branching incur memory costs
- 结构优化
  - 要找到真正限制on-device latency的要素
- 训练优化
  - 直接训练小模型精度肯定差，通常是decoupling train-time & test-time architecture
  - 进一步地还做了relaxing regularization
- the proposed MobileOne
  - use basic operators
  - introduces linear branches which get re-parameterized at inference-time：与之前方法的区别是引入了over-parameterization branches【repVGG是把常规的resblock搞成一个线形op了，本文的branch有k个，是把好多个branch合并一起，所以起名叫over？】
  - inference time model has only feed-forward structure
  - generalizes well to other tasks：在分类、检测、分割上都outperforming
方法
- Metric Correlations
  - parameter count and FLOPs
  - 很多模型参数量很大，但是latency更小——只要右边的点纵坐标比左边的低都是这种case，如3(efficientnet-b0)&2(shufflenet-v2)
  - FLOPs和参数量近似的情况下，卷积模型通常比他们的transformer counterpart latency更小——可以看到transformer模型都在右下角
  - CPU correlation
    - mobile device与FLOPs适度相关，与参数量基本无关
    - CPU的latency更无关
    - 结构的影响更大，SE-block和skip都影响挺大的
- Key Bottlenecks
  - Activation Functions
    - 搞了同样的网络架构，用不同的激活函数
    - 激活函数越花，latency越大，归因于synchronization cost，个别激活函数有通过特殊硬件加速的实现
    - 从通用角度，本文选用ReLU
  - Architectural Blocks
    - 根本因素是memory access cost & degree of parallelism
    - 分支越多，memory access cost越大，因为存在多节点交互更多
    - 而像global pooling这种force synchronization需要同步计算的，也影响overall run-time
    - 上面截图有这个
- MobileOne Architecture
  - MobileOne Block
    - training & test time 不一样
    - training time
      - basic block还是类似mobileV1的depth-wise + point-wise
      - 每个3x3 donv 和 1x1 pconv都变成了一个多分枝的block
      - block有k个 re-parameterizable branch，k是个超参，1-5
      - 除此之外还有两个常规的branch：1x1 dconv & BN
    - inference time
      - 只有一条data steam
      - 合并思路类似repvgg：所有的线形计算都可以合并
  - Model Scaling
    - 提供了5个尺寸的模型，主要变化在channel，深度是没变的
- Training
  - 小模型训练，need less regularization，但是weight decay对训练初期又很重要
  - cosine schedule for both LR & weight decay
  - 还用了efficientNetV2里面提到的progressive learning：从简单任务开始训练，逐渐增加难度（resolution & dataaug） & 超参（regularization）
  - 还有EMA：model ensemble

verseg

发表于 2020-04-15 |

challenge

Large Scale Vertebrae Segmentation Challenge
- task1:Vertebra Labelling，关键点检测
- task2:Vertebra Segmentation，多类别分割
data
1. variation：数据affine轴不统一，尺寸不统一，扫描范围不统一，FOV区域不统一
2. nii的两大解析工具：nibabel库load data的xyz顺序与axcode的顺序一致，e.g.[‘R’,’A’,’S’]的orientation会得到xyz的array，而sitk的读取刚好反过来，sitk的arr会是zyx。我们之前在将dicom写入nii时，会指定一个不为np.eye(4)的affine，就是为了transpose这三个轴。
model
1. team paper \
  - 三阶段：第一阶段，due to large variation FOV of the dataset，粗分割定位脊柱位置，第二阶段，higher resolution多类别关键点定位center，获得each located vertebra，第三阶段，二类分割for each located vertebra。
  - keywords：1. uniform voxel spacing：不要随意resize，todo: trilinear interp；2. on-the-fly data augmentation：using SimpleITK
  - 第一阶段：Spine Localization
    - Unet
    - regress the Gaussian heatmap of spinal centerline
    - L2-loss
    - uniform voxel spacing of 8mm
    - input shape：[64,64,128]，pad？
  - 第二阶段：Vertebrae Localization
    - SpatialConfiguration-Net
    - regress each located vertebra‘s heatmap in individual channel
    - resampling：bi/tricubic interpolation
    - norm：maxmin on the whole dataset
    - uniform voxel spacing of 2mm
    - input shape：[96,96,128]，z-axis random crop，xy-plane use ROI from stage1
  - 第三阶段：Vertebrae Segmentation
    - Unet
    - binary segment the mask of each vertebrae
    - sigmoid ce-loss
    - uniform voxel spacing of 1mm
    - input shape：[128,128,96]，crop origin image & heatmap image based on centroids
2. reference paper\
  - 核心贡献：1.MIP：combines the information across reformations，3D to 2D，2. 基于判别器的训练机制：encodes local spine structure as an anatomical prior，加固椎块间类别&位置的spacial information
  - MIP：
    - localisation and identification rely on a large context
    - large receptive field
    - in full-body scans where spine is not spatially centred or is obstructed by the ribcage, such cases are handled with a pre-processing stage detecting the occluded spine
  - adversarial learning：
    - FCN用于分割
    - AE用于评估分割的好坏
    - do not ‘pre-train’ it (the AE)
    - loss：an anatomically-inspired supervision instead of the usual binary adversarial supervision (vanilla GAN)
  - 先说FCN——Btrfly Network
    - 建模成回归问题，每个关键点对应一个通道的高斯heatmap，背景channel为$1-max_i (y_i)$
    - 双输入双输出（sagittal & coronal）
    - 两个视角的feature map在网络深层做了融合，to learn their inter-dependency
    - Batch- normalisation is used after every convolution layer, along with 20% dropout in the fused layers of Btrfly
    - loss：l2 distance + weighted ce
      $L_{sag} = ||Y_{sag} - \hat{Y}_{sag}||^2 + \omega CE(softmax(Y_{sag}, softmax(\hat{Y}_{sag}))$
      $\omega$ is the median frequency weighing map, boosting the learning of less frequent classes(ECB)
  - 再说判别器——Energy-based adversary for encoding prior
    - fully-convolutional：its predictions across voxels are independent of each other owing to the spatial invariance of convolutions
    - to impose the anatomical prior of the spine’s shape onto the Btrfly net
    - look at $\hat{Y}_{sag}$ and $\hat{Y}_{cor}$ as a 3D volume and employ a 3D AE with a receptive field covering a part of the spine
    - $\hat{Y}_{sag}$ consists of Gaussians：less informative than an image, avoid using max-pooling by resorting to average pooling
    - employ spatially dilated convolution kernels
    - mission of AE：predict the l2 distance of input and its reconstruction, it learns to discriminate by predicting a low E for real annotations, while G learns to generate annotations that would trick D
      $L = D(Y_x) + max(0, m-D(Y_g))\\ L_G = D(Y_g) + L_{fcn}$
  - inference：
    - The values below a threshold (T) are ignored in order to remove noisy predictions
    - 用外积，$\hat{Y}=\hat{Y}_{sag}\otimes\hat{Y}_{cor}$
    - 每个channel的最大值作为centroids
  - experiments
    - 【IMPORTANT】10 MIPs are obtained from one 3D scan per view, each time randomly choosing half the slices of interest
    - 对于每个视角，每次随机抽取一半数目的slice用于计算MIP

similar local appearance：
strong spatial configuration：凡是涉及到椎块-wise的信息，从全局信息入手

GoogLeNet系列

发表于 2020-04-13 |

综述

papers
- [V1] Going Deeper with Convolutions, 6.67% test error
- [V2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 4.8% test error
- [V3] Rethinking the Inception Architecture for Computer Vision, 3.5% test error
- [V4] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, 3.08% test error
- [Xception] Xception: Deep Learning with Depthwise Separable Convolutions
- [EfficientNet] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- [EfficientDet] EfficientDet: Scalable and Efficient Object Detection
- [EfficientNetV2] EfficientNetV2: Smaller Models and Faster Training

大体思路
- inception V1：打破传统的conv block，设计了Inception block，将1*1、3*3、5*5的卷积结果concat，增加网络宽度
- inception V2：加入了BN层，减少Internal Covariate Shift，用两个3*3替代5*5，降低参数量
- inception V3：提出分解Factorization，7*7改成7*1和1*7，参数减少加速计算，增加网络深度和非线性
- inception V4：结合Residual Connection
- Xception：针对inception V3的分解结构的改进，使用可分离卷积
- EfficientNet：主要研究model scaling，针对网络深度、宽度、图像分辨率，有效地扩展CNN
- EfficientDet：将EfficientNet从分类任务扩展到目标检测任务

review

review0122：conv-BN层合并运算
- reference：https://nenadmarkus.com/p/fusing-batchnorm-and-conv/
- freezed BN可以看成1x1的卷积运算
- 两个线性运算是可以合并的
- given $W_{conv} \in R^{CC_{prev}kk}$，$b_{conv} \in R^C $，$W_{bn}\in R^{CC}$，$b_{bn}\in R^C$
  $F = W_{bn} * (W_{conv} * F_{prev} + b_{conv}) + b_{bn}$

V1: Going deeper with convolutions

动机
- improved utilization of the computing resources
- increasing the depth and width of the network while keeping the computational budget
论点
- the recent trend has been to increase the number of layers and layer size, while using dropout to address the problem of overfitting
- major bottleneck：large network，large number of params，limited dataset，overfitting
- methods use filters of different sizes in order to handle multiple scales
- NiN use 1x1 convolutional layers to easily integrate in the current CNN pipelines
- we use 1x1 convs with a dual purpose of dimension reduction
方法
- Architectural
  - 1x1 conv+ReLU for compute reductions
  - an alternative parallel pooling path since pooling operations have been essential for the success
  - overall architecture :
  - 细节：
    - rectified linear activation
    - mean subtraction
    - a move from fully connected layers to average pooling improves acc
    - the use of dropout remained essential
    - adding auxiliary classifiers(on 4c&4d) with a discount weight
      - 5x5 avg pool, stride 3
      - 1x1 conv+relu, 128 filters
      - 1024 fc+relu
      - 70% dropout
      - 1000 fc+softmax
    - asynchronous stochastic gradient descent with 0.9 momentum
    - fixed learning rate schedule (de- creasing the learning rate by 4% every 8 epochs
    - photometric distortions useful to combat overfitting
    - random interpolation methods for resizing

V2: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

动机
- use much higher learning rates
- be less careful about initialization
- also acts as a regularizer, eliminating the need for Dropout
论点
- SGD：optimizes the parameters $\theta$ of the network, so as to minimize the loss
  
  梯度更新： $\theta_{next-timestep} = \theta - \alpha \frac{\partial[\frac{1}{N}\sum_Nloss(\theta)]}{\partial \theta}$ ，$x_i$ is the full set
  
  batch approximation：use $\frac{1}{m} \sum_M \frac{\partial loss(\theta)}{\partial \theta}$，$x_i$ is the mini-batch set
  - quality improves as the batch size increases
  - computation over a batch is much more efficient than m computations for individual examples
  - the learning rate and the initial values require careful tuning
- Internal covariate shift
  - the input distribution of the layers changes
  - consider a gradient descent step above，$x$的数据分布改变了，$\theta$就要相应地 readjust to compensate for the change in the distribution of x
- activation
  - 对于神经元$z = sigmoid(Wx+b)$，前面层的参数变化，很容易导致当前神经元的响应值不在有效活动区间，从而导致过了当前激活函数以后梯度消失，slow down the convergence
  - In practice, using ReLU & careful initialization & small learning rates
  - 如果我们能使得the distribution of nonlinearity inputs remains more stable as the network trains，就不会出现神经元饱和的问题
- whitening
  - 对training set的预处理：linearly transformed to have zero means and unit variances, and decorrelated
  - 使得输入数据的分布保持稳定，normal distribution
  - 同时去除了数据间的相关性
- batch normalization
  - fixes the means and variances of layer inputs
  - reducing the dependence of gradients on the scale of the parameters or of their initial values
  - makes it possible to use saturating nonlinearities

    * full whitening of each layer is costly
    * so we normalize each layer independently, full set--> mini-batch
    * standard normal distribution并不是每个神经元所需的（如identity transform）：introduce, for each activation $x(k)$ , a pair of parameters $\gamma(k)$, $\beta(k)$, which scale and shift the normalized value to maintain the representation ability of the neuron

        

    * for convolutional networks

        * we add the BN transform immediately before the nonlinearity, $z = g(Wx+b)$ to $z = g(BN(Wx))$

        * since we normalize $Wx+b$, the bias b can be ignored  
        * obey the convolutional property——different elements of the same feature map, at different locations, are normalized in the same way
        * We learn a pair of parameters $\gamma(k)$ and $\beta(k)$ **per feature map**, rather than per activation

    * properties

        * back-propagation through a layer is unaffected by the scale of its parameters
        * Moreover, larger weights lead to smaller gradients, thus stabilizing the parameter growth
        * regularizes the model：因为网络中mini-batch的数据之间是有互相影响的而非independent的

方法
- batch normalization
  - full whitening of each layer is costly
  - so we normalize each layer independently, full set—> mini-batch
  - standard normal distribution并不是每个神经元所需的（如identity transform）：introduce, for each activation $x(k)$ , a pair of parameters $\gamma(k)$, $\beta(k)$, which scale and shift the normalized value to maintain the representation ability of the neuron
- bp：
- inference阶段：
  - 首先两个可学习参数$\gamma$和$\beta$是定下来的
  - 而均值和方差不再通过输入数据来计算，而是载入训练过程中维护的参数（moving averages）
- for convolutional networks
  - we add the BN transform immediately before the nonlinearity, $z = g(Wx+b)$ to $z = g(BN(Wx))$
  - since we normalize $Wx+b$, the bias b can be ignored
  - obey the convolutional property——different elements of the same feature map, at different locations, are normalized in the same way
  - We learn a pair of parameters $\gamma(k)$ and $\beta(k)$ per feature map, rather than per activation
- properties
  - back-propagation through a layer is unaffected by the scale of its parameters
  - Moreover, larger weights lead to smaller gradients, thus stabilizing the parameter growth
  - regularizes the model：因为网络中mini-batch的数据之间是有互相影响的而非independent的

V3: Rethinking the Inception Architecture for Computer Vision

动机
- go deeper and wider：
  - enough labeled data
  - computational efficiency
  - parameter count
- to scale up networks
  - utilizing the added computation as efficiently
  - give general design principles and optimization ideas
    - factorized convolutions
    - aggressive regularization
论点
- GoogleNet does not provide a clear description about the contributing factors that lead to the various design
方法
- General Design Principles
  - Avoid representational bottlenecks：特征图尺寸应该gently decrease，resolution的下降必须伴随着channel数的上升，避免使用max pooling层进行下采样，因为这样导致信息损失较大
  - Higher dimensional representations are easier to process locally within a network. Increasing the activa- tions per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster：前半句懂了，high-reso的特征图focus在局部信息，后半句不懂，根据上一篇paper，用了batch norm以后，scale up神经元不影响bp，同时会lead to smaller gradients，为啥能加速？
  - Spatial aggregation can be done over lower dimensional embeddings：adjacent unit之间有strong correlation，所以可以reduce the dimension of the input representation before the spatial aggregation，不会有太大的信息损失，并且promotes faster learning
  - The computational budget should therefore be distributed in a balanced way between the depth and width of the network.
- Factorizing Convolutions Filter Size
  - into smaller convolutions
    - 大filter都可以拆解成多个3x3
    - 单纯去等价线性分解可以不使用非线性activation，但是我们使用了batch norm（increase variaty），所以观察到使用ReLU以后拟合效果更好
  - into Asymmetric Convolutions
    - n*n的filter拆解成1*n和n*1
    - this factorization does not work well on early layers, but gives very good results on medium grid-sizes (ranges between 12 and 20, using 1x7 and 7x1
- Utility of Auxiliary Classifiers
  - did not result in improved convergence early in the training：训练开始阶段没啥用，快收敛时候有点点acc提升
  - removal of the lower auxiliary branch did not have any adverse effect on the final quality：拿掉对最终结果没影响
  - 所以最初的设想（help evolving the low-level features）是错的，仅仅act as regularizer，auxiliary head里面加上batch norm会使得最终结果better
- Efficient Grid Size Reduction下采样模块不再使用maxpooling
  - dxdxk feature map expand to (d/2)x(d/2)x2k：
    - 1x1x2k conv，stride2 pool：kxdxdx2k computation
    - 1x1x2k stride2 conv：kx(d/2)x(d/2)x2k computation，计算量下降，但是违反principle1
    - parallel stride P and C blocks：kx(d/2)x(d/2)xk computation，符合principle1:reduces the grid-size while expands the filter banks
- Inception-v3
  - 开头的7x7conv已经换成了多个3x3
  - 中间层featuremap降维到17x17的时候，开始用Asymmetric Factorization block
  - 到8x8的时候，做了expanding the filter bank outputs
- Label Smoothing （https://zhuanlan.zhihu.com/p/116466239）
  - used the uniform distribution $u(k)=1/K$
    $q(k) = (1-\epsilon)\delta(k) + \frac{\epsilon}{K}$
  - 对于softmax公式：$p(k)=\frac{exp(y_k)}{\sum exp(y_i)}$，这个loss训练的结果就是$y_k$无限趋近于1，其他$y_i$无限趋近于0，
  - 交叉熵loss：$ce=\sum -y_{gt}log(y_k)$，加了label smoothing以后，loss上增加了阴性样本的regularization，正负样本的最优解被限定在有限值，通过抑制正负样本输出差值，使得网络有更强的泛化能力。

V4: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

动机
- residual：whether there are any benefit in combining the Inception architecture with residual connections
- inceptionV4：simplify the inception blocks
论点
- residual connections seems to improve the training speed greatly：但是没有也能训练深层网络
- made uniform choices for the Inception blocks for each grid size
  - Inception-A for 35x35
  - Inception-B for 17x17
  - Inception-C for 8x8
- for residual versions
  - use cheaper Inception blocks for residual versions：简化module，因为identity部分（直接相连的线）本身包含丰富的特征信息
  - 没有使用pooling
  - replace the filter concatenation stage of the Inception architecture with residual connections：原来block里面的concatenation主体放在残差path中
  - Each Inception block is followed by filter-expansion layer (1 × 1 convolution without activation) to match the depth of the input for addition：相加之前保证channel数一致
  - used batch-normalization only on top of the traditional layers, but not on top of the summations：浪费内存
  - number of filters exceeded 1000 causes instabilities
  - scaling down the residuals before adding by factors between 0.1 and 0.3：残差通道响应值不要太大
blocks
- V4 ABC：
- Res ABC：

Xception: Deep Learning with Depthwise Separable Convolutions

动机
- Inception modules have been replaced with depthwise separable convolutions
- significantly outperforms Inception V3 on a larger dataset
- due to more efficient use of model parameters
论点
- early LeNet-style models
  - simple stacks of convolutions for feature extraction and max-pooling operations for spatial sub-sampling
  - increasingly deeper
- complex blocks
  - Inception modules inspired by NiN
  - be capable of learning richer repre- sentations with less parameters
- The Inception hypothesis
  - a single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations
  - while Inception factors it into a series of operations that independently look at cross-channel correlations(1x1 convs) and at spatial correlations(3x3/5x5 convs)
  - suggesting that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly
- inception block先用1x1的conv将原输出映射到3-4个lower space（cross-channel correlations），然后在这些小的3d spaces上做regular conv（maps all correlations ）——进一步假设，彻底解耦，第二步只做spatial correlations
- main differences between “extreme ” Inception and depthwise separable convolution
  - order of the operations：1x1 first or latter
  - non-linearity：depthwise separable convolutions are usually implemented without non-linearities【QUESTION：这和MobileNet里面说的不一样啊，M里面的depthwise也是每层都带了BN和ReLU的】
要素
- Convolutional neural networks
- The Inception family
- Depthwise separable convolutions
- Residual connections
方法
- architecture
  - a linear stack of depthwise separable convolution layers with residual connections
  - all conv are followed by BN
  - keras的separableConv和depthwiseConv：前者由后者加上一个pointwiseConv组成，最后有activation，中间没有
  - cmp
    - Xception and Inception V3 have nearly the same number of parameters
    - marginally better on ImageNet
    - much larger performance increasement on JFT
    - Residual connections are clearly essential in helping with convergence, both in terms of speed and final classification performance.
    - Effect of intermediate activation：the absence of any non-linearity leads to both faster convergence and better final performance

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

动机
- common sense：scaled up the network for better accuracy
- we systematically study model scaling
- and identify that carefully balancing network depth, width, and resolution can lead to better performance
- propose a new scaling method：using compound coefficient to uniformly scale all dimensions of depth/width/resolution
  - on MobileNets and ResNet
  - a new baseline network family EfficientNets
- much better accuracy and efficiency
论点
- previous work scale up one of the three dimensions
  - depth：more layers
  - width：more channels
  - image resolution：higher resolution
- arbitrary scaling requires tedious manual tuning and often yields sub-optimal accuracy and efficiency
- uniformly scaling：Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surprisingly such balance can be achieved by simply scaling each of them with constant ratio.
- neural architecture search：becomes increasingly popular in designing efficient mobile-size ConvNets
方法
- problem formulation
  - ConvNets：$N = \bigodot_{i=1…s} F_i^{L_i}(X_{})$, s for stage, L for repeat times, F for function
  - simplify the design problem
    - fixing $F_i$
    - all layers must be scaled uniformly with constant ratio
  - an optimization problem：d for depth coefficients, w for width coefficients, r for resolution coefficients
    $max_{d,w,r} \ \ \ Accuracy(N(d,w,r))\\ s.t. \ \ \ N(d,w,r) = \bigodot_{i=1...s} F_i^{d*L_i}(X_{<r*H_i,r*W_i,w*C_i>})$
- observation 1
  - Scaling up any dimension of network (width, depth, or resolution) improves accuracy, but the accuracy gain diminishes for bigger models. 准确率都会提升，最终都会饱和
  - depth：deeper ConvNet can capture richer and more complex features
  - width：wider networks tend to be able to capture more fine-grained features and are easier to train （commonly used for small size models）但是深度和宽度最好匹配，一味加宽shallow network会较难提取高级特征
  - resolution：higher resolution input can potentially capture more fine-grained patterns
- observation 2
  - compound scaling：it is critical to balance all dimensions of network width, depth, and resolution
  - different scaling dimensions are not independent 输入更高的resolution，就需要更深的网络，以获取更大的感受野，同时还需要更宽的网络，以捕获更多的细粒度特征
  - compound coefficient $\phi$：
    - $\alpha, \beta, \gamma$ are constants determined by a small grid search, controling the assign among the 3 dimensions [d,w,r]
    - $\phi$ controls how many more resources are available for model scaling
    - the total FLOPS will approximately increase by $2^\phi$
- efficientNet architecture
  - having a good baseline network is also critical
  - thus we developed a new mobile-size baseline called EfficientNet by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS
  - compound scaling：fix $\phi=1$ and grid search $\alpha, \beta, \gamma$, fix $\alpha, \beta, \gamma$ and use different $\phi$
实验
- on MobileNets and ResNets
  - compared to other single-dimension scaling methods
  - compound scaling method improves the accuracy on all
- on EfficientNet
  - model with compound scaling tends to focus on more relevant regions with more object details
  - while other models are either lack of object details or unable to capture all objects in the images
implementing details
- RMSProp: decay=0.9, momentum(rho)=0.9，tpu上使用lars
- BN: momentum=0.99
- weight decay = 1e-5
- lr: initial=0.256, decays by 0.97 every 2.4 epochs
- SiLU activation
- AutoAugment
- Stochastic depth: survive_prob = 0.8
- dropout rate: 0.2 to 0.5 for B0 to B7

EfficientDet: Scalable and Efficient Object Detection

动机
- model efficiency
- for object detection：based on one-stage detector
- 特征融合：propose a weighted bi-directional feature pyramid network (BiFPN)
- 网络rescale：uniformly scales the resolution, depth, and width for all backbone
- achieve better accuracy with much fewer parameters and FLOPs
- also test on Pascal VOC 2012 semantic segmentation
论点
- previous work tends to achieve better efficiency by sacrificing accuracy
- previous work fuse feature at different resolutions by simply summing up without distinction
- EfficientNet
  - backbone：combine EfficientNet backbones with our propose BiFPN
  - scale up：jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
- Existing object detectors
  - two-stage：have a region-of-interest proposal step
  - one-stage：have not, use predefined anchors
方法
- BiFPN：efficient bidirectional cross-scale connec- tions and weighted feature fusion
  - FPN：limit是只有top-bottom一条information flow
  - PANet：加上了一条bottom-up path，better accuracy但是more parameters and computations
  - NAS-FPN：基于网络搜索出的结构，irregular and difficult to interpret or modify
  - BiFPN
    - remove those nodes that only have one input edge：只有一条输入的节点，没做到信息融合
    - add an extra edge from the original input to output node if they are at the same level：fuse more features without adding much cost
    - repeat blocks
  - Weighted Feature Fusion
    - since different input features are at different resolutions, they usually contribute to the output feature unequally
    - learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel)
    - weight normalization
      - Softmax-based：$O=\sum_i \frac{e^{w_i}}{\sum_j e^{w_j}}*I_i$
      - Fast normalized：$O=\sum_i \frac{w_i}{\epsilon + \sum_j w_j}*I_i$，Relu is applied after each $w_i$ to keep non-negative
  - EfficientDet
    - ImageNet-pretrained Effi- cientNets as the backbone
    - BiFPN serves as the feature network
    - the fused features(level 3-7) are fed to a class and box network respectively
    - compound scaling
      - backbone：reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
      - feature network：
        
        depth(layers)：$D=3+\phi$
        
        width(channes)：$W=64 \cdot (1.35^{\phi}) $
      - box/class prediction network：
        
        depth：$D=3+[\phi/3]$
        
        width：same as FPN
      - resolution
        
        use feature 3-7：must be dividable by $2^7$
        
        $R=512+128*\phi$
      - EfficientDet-D0 ($\phi=0$) to D7 ($\phi=7$)
实验
- for object detection
  - train
    - Learning rate is linearly increased from 0 to 0.16 in the first training epoch and then annealed down
    - employ commonly-used focal loss
    - 3x3 anchors
  - compare
    - low-accuracy regime：低精度下，EfficientDet-D0和yoloV3差不多
    - 中等精度，EfficientDet-D1和Mask-RCNN差不多
    - EfficientDet-D7 achieves a new state-of-the-art
- for semantic segmentation
  - modify
    - keep feature level {P2,P3,…,P7} in BiFPN
    - but only use P2 for the final per-pixel classification
    - set the channel size to 128 for BiFPN and 256 for classification head
    - Both BiFPN and classification head are repeated by 3 times
  - compare
    - 和deeplabv3比的，COCO数据集
    - better accuracy and fewer FLOPs
- ablation study
  - backbone improves accuracy v.s. resnet50
  - BiFPN improves accuracy v.s. FPN
  - BiFPN achieves similar accuracy as repeated FPN+PANet
  - BiFPN + weghting achieves the best accuracy
  - Normalized：softmax和fast版本效果差不多，每个节点的weight在训练开始迅速变化（suggesting different features contribute to the feature fusion unequally）
  - Compound Scaling：这个比其他只提高一个指标的效果好就不用说了
超参：

efficientNet和efficientDet的resolution是不一样的，因为检测还有neck和head，层数更深，所以resolution更大

EfficientNetV2: Smaller Models and Faster Training

动机
- faster training speed and better parameter efficiency
- use a new op: Fused-MBConv
- propose progressive learning：adaptively adjuts regularization & image size
方法
- review of EfficientNet
  - large image size
    - large memory usage，small batch size，long training time
    - thus propose increasing image size gradually in V2
  - extensive depthwise conv
    - often cannot fully utilize modern accelerators
    - thus introduce Fused-MBConv in V2：When applied in early stage 1-3, Fused-MBConv can improve training speed with a small overhead on parameters and FLOPs
  - equally scaling up
    - proved sub-optimal in nfnets
    - since the stages are not equally contributed to the efficiency & accuracy
    - thus in V2
      - use a non-uniform scaling strategy：gradually add more layers to later stages(s5 & s6)
      - restrict the max image size
- EfficientNet V2 Architecture
  - basic ConvBlock
    - use fused-MBConv in the early layers
    - use MBConv in the latter layers
  - expansion ratios
    - use smaller expansion ratios
    - 因为同样的通道数，fused-MB比MB的参数量大
  - kernel size
    - 全图3x3，没有5x5了
    - add more layers to compensate the reduced receptive field
  - last stride 1 stage
    - effv1是7个stage
    - effv2有6个stage
  - scaling policy
    - compound scaling：R、W、D一起scale
    - 但是限制了最大inference image size=480（train=384）
    - gradually add more layers to later stages (s5 & s6)
- progressive learning
  - large models require stronger regularization
  - larger image size leads to more computations with larger capacity，thus also needs stronger regularization
  - training process
    - in the early training epochs, we train the network with smaller images and weak regularization
    - gradually increase image size but also making learning more difficult by adding stronger regularization
  - adaptive params
    - image size
    - dropout rate
    - randAug magnitude
    - mixup alpha
    - 给定最大最小值，stage N，使用linear interpolation
- train&test details
  - RMSProp optimizer with decay 0.9 and momentum 0.9
  - batch norm momentum 0.99
  - weight decay 1e-5
  - trained for 350 epochs with total batch size 4096
  - Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs
  - exponential moving average with 0.9999 decay rate
  - stochastic depth with 0.8 survival probability
  - 4 stages (87 epochs per stage)：early stage with weak regularization & later stronger
  - maximum image size for training is about 20% smaller than inference & no further finetuning

FCN

发表于 2020-03-28 |

FCN: Fully Convolutional Networks for Semantic Segmentation

动机
- take input of arbitrary size
- pixelwise prediction (semantic segmentation)
- efficient inference and learning
- end-to-end
- with superwised-pretraining
论点
- fully connected layers brings heavy computation
- patchwise/proposals training with less efficiency (为了对一个像素分类，要扣它周围的patch，一张图的存储容量上升到k*k倍，而且相邻patch重叠的部分引入大量重复计算，同时感受野太小，没法有效利用全局信息)
- fully convolutional structure are used to get a feature extractor which yield a localized, fixed-length feature
- Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a local-to-global pyramid.
- other semantic works (RCNN) are not end-to-end
要素
- 把全连接层换成1*1卷积，用于提取特征，形成热点图
- 反卷积将小尺寸的热点图上采样到原尺寸的语义分割图像
- a novel “skip” architecture to combine deep, coarse, semantic information and shallow, fine, appearance information

方法

fully convolutional network
- receptive fields: Locations in higher layers correspond to the locations in the image they are path-connected to
- typical recognition nets:
  - fixed-input
  - patches
  - the fully connected layers can be viewed as convolutions with kernels that cover their entire input regions
- our structure:
  - arbitrary-input
  - the computation is saved by computing the overlapping regions of those patches only once
  - output size corresponds to the input(H/16, W/16)
    - heatmap: the (H/16 * W/16) high-dims feature-map corresponds to the 1000 classes

coarse predictions to dense
- OverFeat introduced
- 对于高维特征图上一个元素，对应了原图感受野一片区域，将reception field中c位填上这个元素的值
- 移动原图，相应的感受野对应的图片也发生了移动，高维特征图的输出变了，c位变了
- 移动范围stride*stride，就会得到原图尺寸的输出了

upsampling

simplest: bilinear interpolation
in-network upsampling: backwards convolution (deconvolution) with an output stride of f
A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling
factor: FCN里面inputsize和outputsize之间存在线性关系，就是所有卷积pooling层的累积采样步长乘积
kernelsize：$2 * factor - factor \% 2$
stride：$factor$
padding：$ceil((factor - 1) / 2.)$

这块的计算有点绕，$stride=factor$比较好确定，这是将特征图恢复的原图尺寸要rescale的尺度。然后在输入的相邻元素之间插入s-1个0元素，原图尺寸变为$(s-1)(input_size-1)+input_size = sinput_size + (s-1)$，为了得到$output_size=s*input_size$输出，再至少$padding=[(s-1)/2]_{ceil}$，然后根据：

$(s-1) * (in-1) + in + 2p -k + 1 = out$

有：

$2p-k+2 = s$

在keras里面可以调用库函数Conv2DTranspose来实现：

x = Input(shape=(64,64,16))
y = Conv2DTranspose(filters=16, kernel_size=20, strides=8, padding='same')(x)
model = Model(x, y)
model.summary()
# input: (None, 64, 64, 16)   output: (None, 512, 512, 16)   params: 102,416

x = Input(shape=(32,32,16))
y = Conv2DTranspose(filters=16, kernel_size=48, strides=16, padding='same')(x)
# input: (None, 32, 32, 16)   output: (None, 512, 512, 16)   params: 589,840

x = Input(shape=(16,16,16))
y = Conv2DTranspose(filters=16, kernel_size=80, strides=32, padding='same')(x)
# input: (None, 16, 16, 16)   output: (None, 512, 512, 16)   params: 1,638,416

# 参数参考：orig unet的total参数量为36,605,042
# 各级transpose的参数量为：
# (None, 16, 16, 512)     4,194,816
# (None, 32, 32, 512)     4,194,816
# (None, 64, 64, 256)     1,048,832
# (None, 128, 128, 128)   262,272
# (None, 256, 256, 32)    16,416

可以看到kernel_size变大，对参数量的影响极大。（kernel_size设置的小了，只能提取到单个元素，我觉得kernel_size至少要大于stride）

Segmentation Architecture
- use pre-trained model
- convert all fully connected layers to convolutions
- append a 1*1 conv with channel dimension(including background) to predict scores
- followed by a deconvolution layer to upsample the coarse outputs to dense outputs
skips
- the 32 pixel stride at the final prediction layer limits the scale of detail in the upsampled output
- 逐层upsampling，融合前几层的feature map，element-wise add
finer layers: “As they see fewer pixels, the finer scale predictions should need fewer layers.” 这是针对前面的卷积网络来说，随着网络加深，特征图上的感受野变大，就需要更多的channel来记录更多的低级特征组合
- add a 1*1 conv on top of pool4 (zero-initialized)
- adding a 2x upsampling layer on top of conv7 (We initialize this 2xupsampling to bilinear interpolation, but allow the parameters to be learned)
- sum the above two stride16 predictions (“Max fusion made learning difficult due to gradient switching”)
- 16x upsampled back to the image
- 做到第三行再往下，结果又会变差，所以做到这里就停下

总结
- 在升采样过程中，分阶段增大比一步到位效果更好
- 在升采样的每个阶段，使用降采样对应层的特征进行辅助
- 8倍上采样虽然比32倍的效果好了很多，但是结果还是比较模糊和平滑，对图像中的细节不敏感，许多研究者采用MRF算法或CRF算法对FCN的输出结果做进一步优化
- x8为啥好于x32：1. x32的特征图感受野过大，对小物体不敏感 2. x32的放大比例造成的失真更大
- unet的区别：
  - unet没用imagenet的预训练模型，因为是医学图像
  - unet在进行浅层特征融合的时候用了concat而非element-wise add
  - 逐层上采样，x2 vs. x8/x32
  - orig unet没用pad，输出小于输入，FCN则pad+crop
  - 数据增强，FCN没用这些‘machinery’，医学图像需要强augmentation
  - 加权loss

cs231n-RNN-review

发表于 2020-03-15 |

attention系列

发表于 2020-03-13 |

0. 综述

attention的方式分为两种（Reference）
- 学习权重分布
  - 部分加权（hard attention）／全部加权（soft attention）
  - 原图上加权／特征图上加权
  - 空间尺度加权／channel尺度加权／时间域加权／混合域加权
  - CAM系列、SE-block系列：花式加权，学习权重，non-local的模块，作用于某个维度
- 任务分解
  - 设计不同的网络结构（或分支）专注于不同的子任务，
  - 重新分配网络的学习能力，从而降低原始任务的难度，使网络更加容易训练
  - STN、deformable conv：添加显式的模块负责学习形变/receptive field的变化，local模块，apply by pixel
- local / non-local
  - local模块的结果是pixel-specific的
  - non-local模块的结果是全局共同计算的的
基于权重的attention（Reference）
- 注意力机制通常由一个连接在原神经网络之后的额外的神经网络实现
- 整个模型仍然是端对端的，因此注意力模块能够和原模型一起同步训练
- 对于soft attention，注意力模块对其输入是可微的，所以整个模型仍可用梯度方法来优化
- 而hard attention要离散地选择其输入的一部分，这样整个系统对于输入不再是可微的
papers
- [STN] Spatial Transformer Networks
- [deformable conv] Deformable Convolutional Networks
- [CBAM] CBAM: Convolutional Block Attention Module
- [SE-Net] Squeeze-and-Excitation Networks
- [SE-block的一系列变体] SC-SE（for segmentation）、CMPE-SE（复杂又没用）
- [SK-Net] Selective Kernel Networks：是attension module，但是主要改进点在receptive field，trick大杂烩
- [GC-Net] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

CBAM: Convolutional Block Attention Module

动机
- attention module
- lightweight and general
- improvements in classification and detection
论点
- deeper： can obtain richer representation
- increased width：can outperform an extremely deep network
- cardinality：results in stronger representation power than depth and width
- attention：improves the representation of interests
  - humans exploit a sequence of partial glimpses and selectively focus on salient parts
  - Residual Attention Network：computes 3d attention map
  - we decompose the process that learns channel attention and spatial attention separately
  - SE-block：use global average-pooled features
  - we suggest to use max-pooled features as well
方法
- sequentially infers a 1D channel attention map and a 2D spatial attention map
- broadcast and element-wise multiplication
- Channel attention module
  - focuses on ‘what’ is meaningful
  - squeeze the spatial dimension
  - use both average-pooled and max-pooled features simultaneously
  - both descriptors are then forwarded to a shared MLP to reduce dimension
  - 【QUESTION】看论文MLP是线性的吗，没写激活函数
  - then use element-wise summation
  - sigmoid function
- Spatial attention module
  - focuses on ‘where’
  - apply average-pooling and max-pooling along the channel axis and concatenate
  - 7x7 conv
  - sigmoid function
- Arrangement of attention modules
  - in a parallel or sequential manner
  - we found sequential better than parallel
  - we found channel-first order slightly better than the spatial-first
- integration
  - apply CBAM on the convolution outputs in each block
  - in residual path
  - before the add operation
实验
- Ablation studies
  - Channel attention：两个pooling path都有效，一起用最好
  - Spatial attention：1x1conv直接squeeze也行，avg+max更好，7x7conv略好于3x3conv
  - arrangement：前面说了，比SE的单spacial squeeze好，channel在前好于在后，串行好于并行
- Classification results：outperform baselines and SE
- Network Visualization
  - cover the target object regions better
  - the target class scores also increase accordingly
- Object Detection results
  - apply to detectors：right before every classifier
  - apply to backbone

SK-Net: Selective Kernel Networks

动机
- 生物的神经元的感受野是随着刺激变化而变化的
- propose a selective kernel unit
  - adaptively adjust the RF
  - multiple branches with different kernel sizes
  - guided fusion
  - 大杂烩：multi-branch&kernel，group conv，dilated conv，attention mechanism
- SKNet
  - by stacking multiple SK units
  - 在分类任务上验证
论点
- multi-scale aggregation
  - inception block就有了
  - but linear aggregation approach may be insufficient
- multi-branch network
  - two-branch：以resnet为代表，主要是为了easier to train
  - multi-branch：以inception为代表，主要为了得到multifarious features
- grouped/depthwise/dilated conv
  - grouped conv：reduce computation，提升精度
  - depthwise conv：reduce computation，牺牲精度
  - dilated conv：enlarge RF，比dense large kernel节省参数量
- attention mechanism
  - 加权系列：
    - SENet&CBAM：
    - 相比之下SKNet多了adaptive RF
  - 动态卷积系列：
    - STN不好训练，训好以后变换就定死了
    - deformable conv能够在inference的时候也动态的变化变换，但是没有multi-scale和nonlinear aggregation
- thus we propose SK convolution
  - multi-kernels：大size的conv kernel是用了dilated conv
  - nonlinear aggregation
  - computationally lightweight
  - could successfully embedded into small models
  - workflow
    - split
    - fuse
    - select
  - main difference from inception
    - less customized
    - adaptive selection instead of equally addition
方法
- selective kernel convolution
  - split
    - multi-branch with different kernel size
    - grouped/depthwise conv + BN + ReLU
    - 5x5 kernel can be further replaced with dilated conv
  - fuse
    - to learn the control of information flow from different branches
    - element-wise summation
    - global average pooling
    - fc-BN-ReLU：reduce dimension，at least 32
  - select
    - channel-wise weighting factor A & B & more：A+B + more = 1
    - fc-softmax
    - 在2分支的情况下，一个权重矩阵A就够了，B是冗余的，因为可以间接算出来
    - reweighting
- network
  - start from resnext
  - repeated SK units：类似bottleneck
    - 1x1 conv
    - SK conv
    - 1x1 conv
    - hyperparams
      - number of branches M=2
      - group number G=32：cardinality of each path
      - reduction ratio r=16：fuse operator中dim-reduction的参数
  - 嵌入到轻量的网络结构
    - MobileNet/shuffleNet
    - 把其中的3x3 depthwise卷积替换成SK conv
实验
- 比sort的resnet、densenet、resnext精度都要好

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

动机
- Non-Local Network (NLNet)
  - capture long-range dependencies
  - obtain query-specific global context
  - but we found global contexts are almost the same for different query positions
- we produce
  - query-independent formulation
  - smiliar structure as SE-Net
  - aims at global context modeling
论点
- Capturing long-range dependency
  - mainly by sdeeply stacking conv layers：inefficient
  - non-local network
    - via self-attention mechanism
    - computes the pairwise relations between the query position then aggregate
    - 但是不同位置query得到的attention map基本一致
  - we simply the non-local block
    - query-independent
    - maintain acc & save computation
- our proposed GC-block
  - unifies both the NL block and the SE block
  - three steps
    - global context modeling：
    - feature transform module：capture channel-wise interdependency
    - fusion module：merge into the original features
- 多种任务上均有涨点
  - 但都是在跟resnet50对比
revisit NLNet
- non-local block
  - $f(x_i, x_j)$：
    - encodes the relationship between position i & j
    - 计算方式有Gaussian、Embedded Gaussian、Dot product、Concat
    - different instantiations achieve comparable performance
  - $C(x)$：norm factor
  - $x_i + \sum^{N_p} F(x_j)$：aggregates a specific global feature on $x_i$
  - widely-used Embedded Gaussian：
  - 嵌入方式：
    - Mask R-CNN with FPN and Res50
    - only add one non-local block right before the last residual block of res4
  - observations & inspirations
    - distances among inputs show that input features are discriminated
    - outputs & attention maps are almost the same：global context after training is actually independent of query position
    - inspirations
      - simplify the Non-local block
      - no need of query-specific
方法
- simplifying form of NL block：SNL
  - 求一个common的global feature，share给全图每个position
  - 进一步简化：把$x_j$的1x1 conv提到前面，FLOPs大大减少，因为feature scale从HW变成了1x1
  - the SNL block achieves comparable performance to the NL block with significantly lower FLOPs
- global context modeling
  - SNL可以抽象成三部分：
    - global attention pooling：通过$W_k$ & softmax获取attention weights，然后进行global pooling
    - feature transform：1x1 conv
    - feature aggregation：broadcast element-wise add
  - SE-block也可以分解成类似的抽象
    - global attention pooling：用了简单的global average pooling
    - feature transform：用了squeeze & excite的fc-relu-fc-sigmoid
    - feature aggregation：broadcast element-wise multiplication
- Global Context Block
  - integrate the benefits of both
    - SNL global attention pooling：effective modeling on long-range dependency
    - SE bottleneck transform：light computation（只要ratio大于2就会节省参数量和计算量）
  - 特别地，在SE transform的squeeze layer上，又加了BN
    - ease optimization
    - benefit generalization
  - fusion：add
  - 嵌入方式：
    - GC-ResNet50
    - add GC-block to all layers (c3+c4+c5) in resnet50 with se ratio of 16
- relationship to SE-block
  - 首先是fusion method reflects different goals
    - SE基于全局信息rescales the channels，间接使用
    - GC直接使用，将long-range dependency加在每个position上
  - 其次是norm layer
    - ease optimization
  - 最后是global attention pooling
    - SE的GAP是a special case
    - weighting factors shows superior

Deeplab系列

发表于 2020-02-24 |

综述

papers
- deeplabV1: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS AND FULLY CONNECTED CRFS，主要贡献提出了空洞卷积，使得feature extraction阶段输出的特征图维持较高的resolution
- deeplabV2: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs，主要贡献是多尺度ASPP结构

deeplabV3: Rethinking Atrous Convolution for Semantic Image Segmentation，提出了基于ResNet的串行&并行两种结构，细节上提到了multi-grid，改进了ASPP模块
- deeplabV3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

分割结果比较粗糙的原因
- 池化：将全图抽象化，降低分辨率，会丢失细节信息，平移不变性，使得边界信息不清晰
- 没有利用标签之间的概率关系：CNN缺少对空间、边缘信息等约束
对此，deeplabV1引入了
- 空洞卷积：VGG中提出的多个小卷积核代替大卷积核的方法，只能使感受野线性增长，而多个空洞卷积串联，可以实现指数增长。
- 全连接条件随机场CRF：作为stage2，提高模型捕获细节的能力，提升边界分割精度
大小物体同时分割

deeplabV2引入
- 多尺度ASPP(Atrous Spatial Pyramid Pooling)：并行的采用多个采样率的空洞卷积提取特征，再进行特征融合
- backbone model change：VGG16改为ResNet
- 使用不同的学习率
进一步改进模型架构

deeplabV3引入
- ASPP嵌入ResNet后几个block
- 去掉了CRF
使用原始的Conv/pool操作，得到的low resolution score map，pool stride会使得过程中丢弃一部分信息，上采样会得到较大的失真图像，使用空洞卷积，保留特征图上的全部信息，同时keep resolution，减少了信息损失
DeeplabV3的ASPP相比较于V2，增加了一条1x1 conv path和一条image pooling path，加GAP这条path是因为，实验中发现，随着rate的增大，有效的weight数目开始减少（部分超出边界无法有效捕捉远距离信息），因此利用global average pooling提取了image-level的特征并与ASPP的特征并在一起，来补充因为dilation丢失的信息

空洞卷积的path，V2是每条path分别空洞卷积然后接两个1x1conv（没有BN），V3是空洞卷积和BatchNormalization组合

fusion方式，V2是sum fusion，V3是所有path concat然后1x1 conv，得到最终score map
DeeplabV3的串行版本，“In order to maintain original image size, convolutions are replaced with strous convolutions with rates that differ from each other with factor 2”，ppt上说后面几个block复制了block4，每个block里面三层conv，其中最后一层conv stride2，然后为了maintain output size，空洞rate*2，这个不太理解。

multi-grid method：对每个block里面的三层卷积采用不同空洞率，unit rate（e.g.(1,2,4)） * rate （e.g. 2）

deeplabV1: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS AND FULLY CONNECTED CRFS

动机
- brings together methods from Deep Convolutional Neural Networks and probabilistic graphical models
- poor localization property of deep networks
- combine a fully connected Conditional Random Field (CRF)
- be able to localize segment boundaries beyond previous accuracies
- speed: atrous
- accuracy:
- simplicity: cascade modules
论点
- DCNN learns hierarchical abstractions of data, which is desirable for high-level vision tasks (classification)
- but it hampers low-level tasks, such as pose estimation and semantic segmentation, where we want precise localization, rather than abstraction of spatial details
- two technical hurdles in DCNNs when applying to image labeling tasks
  - pooling, loss of resolution: we employ the ‘atrous’ (with holes) for efficient dense computation
  - spacial invariance: we use the fully connected pairwise CRF to capture fine edge details
- Our approach
  - treats every pixel as a CRF node
  - exploits long-range dependencies
  - and uses CRF inference to directly optimize a DCNN-driven cost function
方法
- structure
  - fully convolutional VGG-16
  - keep the first 3 subsampling blocks for a target stride of 8
  - use hole algorithm conv filters for the last two blocks
  - keep the pooling layers for the purpose of fine-tuing，change strides from 2 to 1
  - for dense map(h/8), the first fully convolutional 7*7*4096 is computational, thus change to 4*4 / 3*3 convs
  - further computation decreasement: reduce the fc channels from 4096 to 1024
- train
  - label：ground truth subsampled by 8
  - loss function：cross-entropy
- test
  - x8：simply bilinear interpolation
  - fcn：stride32 forces them to use learned upsampling layers, significantly increasing the complexity and training time
- CRF
  - short-range：used to smooth noisy
  - fully connected model：to recover detailed local structure rather than further smooth it
  - energy function:
    $E(x) = \sum_{i}\theta_i(x_i) + \sum_{ij}\theta_{ij}(x_i, x_j)\\ \theta_i(x_i) = -logP(x_i)\\$
    $P(x_i)$ is the bi-linear interpolated probability output of DCNN.
    $\theta_{ij}(x_i, x_j) = \mu(x_i, x_j)\sum_{m=1}^K \omega_m k^m (f_i,f_j)\\ \mu(x_i, x_j) = \begin{cases} 1& \text{if }x_i \neq x_j\\ 0& \text{otherwise} \end{cases}$
    $k^m(f_i, f_j)$ is the Gaussian kernel depends on features (involving pixel positions & pixel color intensities)
- multi-scale prediction
  - to increase the boundary localization accuracy
  - we attach to the input image and the output of each of the first four max pooling layers a two-layer MLP (first layer: 128 3x3 convolutional filters, second layer: 128 1x1 convolutional filters)
  - the feature maps above is concatenated to the main network’s last layer feature map
  - the new outputs is enhanced by 128*5=640 channels
  - we only adjust the newly added weights
  - introducing these extra direct connections from fine-resolution layers improves localization performance, yet the effect is not as dramatic as the one obtained with the fully-connected CRF
空洞卷积dilated convolution
- 空洞卷积相比较于正常卷积，多了一个 hyper-parameter——dilation rate，指的是kernel的间隔数量(正常的convolution dilatation rate是1)
- fcn：先pooling再upsampling，过程中有信息损失，能不能设计一种新的操作，不通过pooling也能有较大的感受野看到更多的信息呢？
- 如图(b)的2-dilated conv，kernel size只有3x3，但是这个卷积的感受野已经增大到了7x7（假设前一层是3x3的1-dilated conv）
- 如图(c)的4-dilated conv，kernel size只有3x3，但是这个卷积的感受野已经增大到了15x15（假设前两层是3x3的1-dilated conv和3x3的2-dilated conv）
- 而传统的三个3x3的1-dilated conv堆叠，只能达到7x7的感受野
- dilated使得在不做pooling损失信息的情况下，加大了感受野，让每个卷积输出都包含较大范围的信息
- The Gridding Effect：如下图，多次叠加3x3的2-dilated conv，会发现我们将愿输入离散化了。因此叠加卷积的 dilation rate 不能有大于1的公约数。
- Long-ranged information：增大dilation rate对大物体有效果，对小物体可能有弊无利
- HDC(Hybrid Dilated Convolution)设计结构
  - 叠加卷积的 dilation rate 不能有大于1的公约数，如[2,4,6]
  - 将 dilation rate 设计成锯齿状结构，例如 [1, 2, 5, 1, 2, 5] 循环结构，锯齿状能够同时满足小物体大物体的分割要求(小 dilation rate 来关心近距离信息，大 dilation rate 来关心远距离信息)
  - 满足$M_i = max [M_{i+1}-2r_i, M_{i+1}-2(M_{i+1}-r_i), r_i]$，$M_i$是第i层最大dilation rate
  - 一个可行方案[1,2,5]：

deeplabV2: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

动机
- atrous convolution：control the resolution
- atrous spatial pyramid pooling (ASPP) ：multiple sampling rates
- fully connected Conditional Random Field (CRF)
论点
- three challenges in the application of DCNNs to semantic image segmentation
  - reduced feature resolution：max-pooling and downsampling (‘striding’) —> atrous convolution
  - existence of objects at multiple scales：multi input scale —> ASPP
  - reduced localization accuracy due to DCNN invariance：skip-layers —> CRF
- improvements compared to its first version
  - better segment objects at multiple scales
  - ResNet replaces VGG16
  - a more comprehensive experimental evaluation on models & dataset
- related works
  - jointly learning of the DCNN and CRF to form an end-to-end trainable feed-forward network
  - while in our work still a 2 stage process
  - use a series of atrous convolutional layers with increasing rates to aggregate multiscale context
  - while in our structure using parallel instead of serial
方法
- atrous convolution
  - 在下采样以后的特征图上，运行普通卷积，相当于在原图上运行上采样的filter
    - 1-D示意图上可以看出，两者感受野相同
    - 同时能保持high resolution
  - while both the number of filter parameters and the number of operations per position stay constant
  - 把backbone中下采样的层(pooling/conv)中的stride改成1，然后将接下来的conv层都改成2-dilated conv：could allow us to compute feature responses at the original image resolution
  - efficiency/accuracy trade-off：using atrous convolution to increase the resolution by a factor of 4
  - followed by fast bilinear interpolation by a factor of 8 to the original image resolution
  - Bilinear interpolation is sufficient in this setting because the class score maps are quite smooth unlike FCN
  - Atrous convolution offers easily control of the field-of-view and finds the best trade-off between accurate localization (small field-of-view) and context assimilation (large field-of-view)：大感受野，抽象融合上下文，大感受野，low-level局部信息准确
  - 实现：（1）根据定义，给filter上采样，插0；（2）给feature map下采样得到k*k个reduced resolution maps，然后run orgin conv，组合位移结果
- ASPP
  - multi input scale：
    - run parallel DCNN branches that share the same parameters
    - fuse by taking at each position the maximum response across scales
    - computing
  - spatial pyramid pooling
    - run multiple parallel filters with different rates
    - multi-scale features are further processed in separate branches：fc7&fc8
    - fuse：sum fusion
- CRF：keep the same as V1

deeplabV3: Rethinking Atrous Convolution for Semantic Image Segmentation

动机
- for segmenting objects at multiple scales
  - employ atrous convolution in cascade or in parallel with multiple atrous rates
  - augment ASPP with image-level features encoding global context and further boost performance
- without DenseCRF
论点
- our proposed module consists of atrous convolution with various rates and batch normalization layers
- modules in cascade or in parallel：when applying a 3*3 atrous convolution with an extremely large rate, it fails to capture long range information due to image boundary effects
方法
- Atrous Convolution
  
  for each location $i$ on the output $y$ and a filter $w$, an $r$-rate atrous convolution is applied over the input feature map $x$：
  $y[i] = \sum_k x[i+rk]w[k]$
- in cascade
  - duplicate several copies of the last ResNet block (block4)
  - extra block5, block6, block7 as replicas of block4
  - multi-rates
- ASPP
  - we include batch normalization within ASPP
  - as the sampling rate becomes larger, the number of valid filter weights becomes smaller (beyond boundary)
  - to incorporate global context information：we adopt image-level features by GAP on the last feature map of the model
    
    GAP —> 1*1*256 conv —> BN —> bilinearly upsample
  - fusion: concatenated + 1*1 conv
  - seg：final 1*1*n_classes conv
- training details
  - large crop size required to make sure the large atrous rates effective
  - upsample the output: it is important to keep the groundtruths intact and instead upsample the final logits
结论
- output stride=8 好过16，但是运算速度慢了几倍

deeplabV3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

动机
- spatial pyramid pooling module captures rich contextual information
- encode-decoder structure captures sharp object boundaries
- combine the above two methods
- propose a simple yet effective decoder module
- explore Xception backbone
论点
- even though rich semantic information is encoded through ASPP, detailed information related to object boundaries is missing due to striding operations
- atrous convolution could alleviate but suffer the computational balance
- while encoder-decoder models lend themselves to faster computation (since no features are dilated) in the encoder path and gradually recover sharp object boundaries in the decoder path
- 所谓encoder-decoder structure，就是通过encoder和decoder之间的短连接来将不同尺度的特征集成起来，增加这样的shortcut，同时增大网络的下采样率（encoder path上不使用空洞卷积，因此为了达到同样的感受野，得增加pooling，然后保留最底端的ASPP block），既减少了计算，又enrich了local border这种细节特征
- applying the atrous separable convolution to both the ASPP and decoder modules：最后又引入可分离卷积，进一步提升计算效率
方法
- atrous separable convolution
  - significantly reduces the computation complexity while maintaining similar (or better) performance
- DeepLabv3 as encoder
  - output_stride=16/8：remove the striding of the last 1/2 blocks
  - atrous convolution：apply atrous convolution to the blocks without striding
  - ASPP：run 1x1 conv in the end to set the output channel to 256
- proposed decoder
  - naive decoder：bilinearly upsampled by 16
  - proposed：first bilinearly upsampled by 4, then concatenated with the corresponding low-level features
  - low-level features：
    - apply 1x1 conv on the low-level features to reduce the number of channels to avoid outweigh the importance
    - the last feature map in res2x residual block before striding
  - combined features：apply 3x3 conv(2 layers, 256 channels) to obtain sharper segmentation results
  - more shortcut：observed no significant improvement
- modified Xception backbone
  - deeper
  - all the max pooling operations are replaced with depthwise separable convolutions with striding
  - DWconv-BN-ReLU-PWconv-BN-ReLU
实验
1. decoder effect on border
f

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

amber.zhang

要糖有糖，要猫有猫

GitHub