FCN | Less is More

FCN: Fully Convolutional Networks for Semantic Segmentation

动机
- take input of arbitrary size
- pixelwise prediction (semantic segmentation)
- efficient inference and learning
- end-to-end
- with superwised-pretraining
论点
- fully connected layers brings heavy computation
- patchwise/proposals training with less efficiency (为了对一个像素分类，要扣它周围的patch，一张图的存储容量上升到k*k倍，而且相邻patch重叠的部分引入大量重复计算，同时感受野太小，没法有效利用全局信息)
- fully convolutional structure are used to get a feature extractor which yield a localized, fixed-length feature
- Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a local-to-global pyramid.
- other semantic works (RCNN) are not end-to-end
要素
- 把全连接层换成1*1卷积，用于提取特征，形成热点图
- 反卷积将小尺寸的热点图上采样到原尺寸的语义分割图像
- a novel “skip” architecture to combine deep, coarse, semantic information and shallow, fine, appearance information

方法

fully convolutional network
- receptive fields: Locations in higher layers correspond to the locations in the image they are path-connected to
- typical recognition nets:
  - fixed-input
  - patches
  - the fully connected layers can be viewed as convolutions with kernels that cover their entire input regions
- our structure:
  - arbitrary-input
  - the computation is saved by computing the overlapping regions of those patches only once
  - output size corresponds to the input(H/16, W/16)
    - heatmap: the (H/16 * W/16) high-dims feature-map corresponds to the 1000 classes

coarse predictions to dense
- OverFeat introduced
- 对于高维特征图上一个元素，对应了原图感受野一片区域，将reception field中c位填上这个元素的值
- 移动原图，相应的感受野对应的图片也发生了移动，高维特征图的输出变了，c位变了
- 移动范围stride*stride，就会得到原图尺寸的输出了

upsampling

simplest: bilinear interpolation
in-network upsampling: backwards convolution (deconvolution) with an output stride of f
A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling
factor: FCN里面inputsize和outputsize之间存在线性关系，就是所有卷积pooling层的累积采样步长乘积
kernelsize：$2 * factor - factor \% 2$
stride：$factor$
padding：$ceil((factor - 1) / 2.)$

这块的计算有点绕，$stride=factor$比较好确定，这是将特征图恢复的原图尺寸要rescale的尺度。然后在输入的相邻元素之间插入s-1个0元素，原图尺寸变为$(s-1)(input_size-1)+input_size = sinput_size + (s-1)$，为了得到$output_size=s*input_size$输出，再至少$padding=[(s-1)/2]_{ceil}$，然后根据：

$(s-1) * (in-1) + in + 2p -k + 1 = out$

有：

$2p-k+2 = s$

在keras里面可以调用库函数Conv2DTranspose来实现：

x = Input(shape=(64,64,16))
y = Conv2DTranspose(filters=16, kernel_size=20, strides=8, padding='same')(x)
model = Model(x, y)
model.summary()
# input: (None, 64, 64, 16)   output: (None, 512, 512, 16)   params: 102,416

x = Input(shape=(32,32,16))
y = Conv2DTranspose(filters=16, kernel_size=48, strides=16, padding='same')(x)
# input: (None, 32, 32, 16)   output: (None, 512, 512, 16)   params: 589,840

x = Input(shape=(16,16,16))
y = Conv2DTranspose(filters=16, kernel_size=80, strides=32, padding='same')(x)
# input: (None, 16, 16, 16)   output: (None, 512, 512, 16)   params: 1,638,416

# 参数参考：orig unet的total参数量为36,605,042
# 各级transpose的参数量为：
# (None, 16, 16, 512)     4,194,816
# (None, 32, 32, 512)     4,194,816
# (None, 64, 64, 256)     1,048,832
# (None, 128, 128, 128)   262,272
# (None, 256, 256, 32)    16,416

可以看到kernel_size变大，对参数量的影响极大。（kernel_size设置的小了，只能提取到单个元素，我觉得kernel_size至少要大于stride）

Segmentation Architecture
- use pre-trained model
- convert all fully connected layers to convolutions
- append a 1*1 conv with channel dimension(including background) to predict scores
- followed by a deconvolution layer to upsample the coarse outputs to dense outputs
skips
- the 32 pixel stride at the final prediction layer limits the scale of detail in the upsampled output
- 逐层upsampling，融合前几层的feature map，element-wise add
finer layers: “As they see fewer pixels, the finer scale predictions should need fewer layers.” 这是针对前面的卷积网络来说，随着网络加深，特征图上的感受野变大，就需要更多的channel来记录更多的低级特征组合
- add a 1*1 conv on top of pool4 (zero-initialized)
- adding a 2x upsampling layer on top of conv7 (We initialize this 2xupsampling to bilinear interpolation, but allow the parameters to be learned)
- sum the above two stride16 predictions (“Max fusion made learning difficult due to gradient switching”)
- 16x upsampled back to the image
- 做到第三行再往下，结果又会变差，所以做到这里就停下

总结
- 在升采样过程中，分阶段增大比一步到位效果更好
- 在升采样的每个阶段，使用降采样对应层的特征进行辅助
- 8倍上采样虽然比32倍的效果好了很多，但是结果还是比较模糊和平滑，对图像中的细节不敏感，许多研究者采用MRF算法或CRF算法对FCN的输出结果做进一步优化
- x8为啥好于x32：1. x32的特征图感受野过大，对小物体不敏感 2. x32的放大比例造成的失真更大
- unet的区别：
  - unet没用imagenet的预训练模型，因为是医学图像
  - unet在进行浅层特征融合的时候用了concat而非element-wise add
  - 逐层上采样，x2 vs. x8/x32
  - orig unet没用pad，输出小于输入，FCN则pad+crop
  - 数据增强，FCN没用这些‘machinery’，医学图像需要强augmentation
  - 加权loss