transform in CNN

综述

  1. 几何变换
    • STN:
      • 普通的CNN能够隐式的学习一定的平移、旋转不变性,让网络能够适应这种变换:降采样结构本身能够使得网络对变换不敏感
      • 从数据角度出发,我们还会引入各种augmentation,强化网络对变化的不变能力
      • deepMind为网络设计了一个显式的变换模块来学习各种变化,将distorted的输入变换回去,让网络学习更简单的东西
      • 参数量:就是变换矩阵的参数,通常是2x3的纺射变化矩阵,也就是6个参数
    • deformable conv:
      • based on STN
      • 针对分类和检测分别提出deformable convolution和deformable RoI pooling:
      • 感觉deformable RoI pooling和guiding anchor里面的feature adaption是一个东西
      • 参数量:regular kernel params 3x3 + deformable offsets 3x3x2
      • what’s new?
        • 个人认为引入更多的参数引入的变化
        • 首先STN是从output到input的映射,使用变换矩阵M通常只能表示depictable transformation,且全图只有1个transformation
        • 其次STN的sampling kernel也是预定义的算法,对kernel内的所有pixel使用相同的变化,也就是1个weight factor
        • deformable conv是从input到output的映射,映射可以是任意的transformation,且3x3x2的参数最多可以包含3x3种transformation
        • sampling kernel对kernel内的每个点,也可以有不同的权重,也就是3x3个weight factor
    • 还有啥跟形变相关的
  2. attention机制
    • spatial attention:STN,sSE
    • channel attention:SENet
    • 同时使用空间attention和通道attention机制:CBAM
  3. papers

    • [STN] STN: Spatial Transformer Networks,STN的变换是pre-defined的,是针对全局featuremap的变换
    • [DCN 2017] Deformable Convolutional Networks ,DCN的变换是更随机的,是针对局部kernel分别进行的变化,基于卷积核添加location-specific shift
    • [DCNv2 2018] Deformable ConvNets v2: More Deformable, Better Results,进一步消除irrelevant context,基于卷积核添加weighted-location-specific shift,提升performance
    • [attention系列paper] [SENet &SKNet & CBAM & GC-Net][https://amberzzzz.github.io/2020/03/13/attention%E7%B3%BB%E5%88%97/]

STN: Spatial Transformer Networks

  1. 动机

    • 传统卷积:lack the ability of spacially invariant
    • propose a new learnable module
      • can be inserted into CNN
      • spatially manipulate the data
      • without any extra supervision
      • models learn to be invariant to transformations
  2. 论点

    • spatially invariant
      • the ability of being invariant to large transformations of the input data
    • max-pooling
      • 在一定程度上spatially invariant
      • 因为receptive fields are fixed and local and small
      • 必须叠加到比较深层的时候才能实现,intermediate feature layers对large transformations不太行
      • 是一种pre-defined mechanism,跟sample无关
    • spatial transformation module
      • conditioned on individual samples
      • dynamic mechanism
      • produce a transformation and perform it on the entire feature map
    • task场景
      • distorted digits分类:对输入做tranform能够simplify后面的分类任务
      • co-localisation:
      • spatial attention
    • related work
      • 生成器用来生成transformed images,从而判别器能够学习分类任务from transformation supervision
      • 一些methods试图从网络结构、feature extractors的角度的获得invariant representations,while STN aims to achieve this by manipulating the data
      • manipulating the data通常就是基于attention mechanism,crop涉及differentiable问题
  3. 方法

    • formulation

      • localisation network:predict transform parameters
      • grid generator:基于predicted params生成sampling grid
      • sampler:element-multiply

    • localisation network

      • input feature map $U \in R^{hwc}$
      • same transformation is applied to each channel
      • generate parameters of transformation $\theta$:1-d vector
      • fc / conv + final regression layer
    • parameterised sampling grid

      • sampling kernel

      • applied by pixel

      • general affine transformation:cropping,translation,rotation,scale,skew

      • ouput map上任意一点一定来自变换前的某一点,反之不一定,input map上某一点可能是bg,被crop掉了,所以pointwise transformation写成反过来的:

      • target points构成的点集就是sampling points on the input feature map

    • differentiable image sampling

      • 通过上一步的矩阵transformation,得到input map上需要保留的source point set

      • 对点集中每一点apply kernel

      • 通用的插值表达式:

      • 最近邻kernel是个pulse函数

      • bilinear kernel是个distance>1的全mute掉,分段可导

    • STN:Spatial Transformer Networks

      • 把spatial transformer嵌进CNN去:learn how to actively transform the features to help minimize the overall cost
      • computationally fast
      • 几种用法
        • feed the output of the localization network $\theta$ to the rest of the network:因为transform参数explicitly encodes目标的位置姿态信息
        • place multiple spatial transformers at increasing depth:串行能够让深层的transformer学习更抽象的变换
        • place multiple spatial transformers in parallel:并行的变换使得每个变换针对不同的object
  4. 实验

    • R、RTS、P、E:distortion ahead
    • aff、proj、TPS:transformer predefined
      • aff:给定角度??
      • TPS:薄板样条插值

Deformable Convolutional Networks

  1. 动机

    • CNN:fixed geometric structures
    • enhance the transformation modeling capability
      • deformable convolution
      • deformable RoI pooling
    • without additional supervision
    • share similiar spirit with STN
  2. 论点

    • to accommodate geometric variations

      • data augmentation is limited to model large, unknown transformations
      • fixed receptive fields is undesirable for high level CNN layers that encode the semantics
      • 使用大量增广的数据,枚举不全,而且收敛慢,所需网络参数量大
      • 对于提取语义特征的高层网络来讲,固定的感受野对不同目标不友好
    • introduce two new modules

      • deformable convolution
        • learning offsets for each kernel via additional convolutional layers
      • deformable RoI pooling

        • learning offset for each bin partition of the previous RoI pooling

  3. 方法

    • overview

      • operate on the 2D spatial domain
      • remains the same across the channel dimension
    • deformable convolution

      • 正常的卷积:
        • $y(p_0) = \sum w(p_n)*x(p_0 + p_n)$
        • $p_n \in R\{(-1,-1),(-1,0),…, (0,0), (1,1)\}$
      • deformable conv:with offsets $\Delta p_n$
        • $y(p_0) = \sum w(p_n)*x(p_0 + p_n + \Delta p_n)$
        • offset value is typically fractional
        • bilinear interpolation:
          • $x(p) = \sum_q G(q,p)x(q)$
          • 其中$G(q,p)$是条件:$G(q,p)=max(0, 1-|q_x-p_x|)*max(0, 1-|q_y-p_y|)$
          • 只计算和offset点距离小于1个单位的邻近点
      • 实现
        • offsets conv和特征提取conv是一样的kernel:same spatial resolution and dilation(N个position)
        • the channel dimension 2N:因为是x和y两个方向的offset
    • deformable RoI pooling

      • RoI pooling converts an input feature map of arbitrary size into fixed size features

      • 常规的RoI pooling

        • divides ROI into k*k bins and for each bin:$y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p)/n_{ij}$
        • 对feature map上划分到每个bin里面所有的点
      • deformable RoI pooling:with offsets $\Delta p_{ij}$

        • $y(i,j) = \sum_{p \in bin(i,j)} x(p_0+p+\Delta p_{ij})/n_{ij}$
        • scaled normalized offsets:$\Delta p_{ij} = \gamma \Delta p_{ij} (w,h) $
        • normalized offset value is fractional
        • bilinear interpolation on the pooled map as above
      • 实现

        • fc layer:k*k*2个element(sigmoid?)
      • position sensitive RoI Pooling

        • fully convolutional
        • input feature map先通过卷积扩展成k*k*(C+1)通道
        • 对每个C+1(包含kk个feature map),conv出全图的offset(2\k*k个)

    • deformable convNets

      • initialized with zero weights
      • learning rates are set to $\beta$ times of the learning rate for the existing layers
        • $\beta=1.0$ for conv
        • $\beta=0.01$ for fc
      • feature extraction
        • back:ResNet-101 & Aligned-Inception-ResNet
        • withoutTop:A randomly initialized 1x1 conv is added at last to reduce the channel dimension to 1024
        • last block
          • stride is changed from 2 to 1
          • the dilation of all the convolution filters with kernel size>1 is changed from 1 to 2
        • Optionally last block
          • use deformable conv in res5a,b,c
      • segmentation and detection
        • deeplab predicts 1x1 score maps
        • Category-Aware RPN run region proposal with specific class
        • modified faster R-CNN:add ROI pooling at last conv
        • optional faster R-CNN:use deformable ROI pooling
        • R-FCN:state-of-the-art detector
        • optional R-FCN:use deformable ROI pooling

  4. 实验

    • Accuracy steadily improves when more deformable convolution layers are used:使用越多层deform conv越好,经验取了3

    • the learned offsets are highly adaptive to the image content:大目标的间距大,因为reception field大,consistent in different layers

    • atrous convolution also improves:default networks have too small receptive fields,但是dilation需要手调到最优

    • using deformable RoI pooling alone already produces noticeable performance gains, using both obtains significant accuracy improvements

Deformable ConvNets v2: More Deformable, Better Results

  1. 动机

    • DCN能够adapt一定的geometric variations,但是仍存在extend beyond image content的问题
    • to focus on pertinent image regions
      • increased modeling power
        • more deformable layers
        • updated DCNv2 modules
      • stronger training
        • propose feature mimicking scheme
    • verified on
      • incorporated into Faster-RCNN & Mask RCNN
      • COCO for det & set
    • still lightweight and easy to incorporate
  2. 论点

    • DCNv1
      • deformable conv:在standard conv的基础上generate location-specific offsets which are learned from the preceding feature maps
      • deformable pooling:offsets are learned for the bin positions in RoIpooling
      • 通过可视化散点图发现有部分散点落在目标外围
    • propose DCNv2
      • equip more convolutional layers with offset
      • modified module
        • each sample not only undergoes a learned offset
        • but also a learned feature amplitude
      • effective trainin
        • use RCNN as the teacher network since RCNN learns features unaffected by irrelevant info outside the ROI
        • feature mimicking loss
  3. 方法

    • stacking more deformable conv layers

      • replace more regular conv layers by their deformable counterparts:
        • resnet50的stage3、4、5的3x3conv都替换成deformable conv:13个conv layer
        • DCNv1是把stage5的3个resblock的3x3 conv替换成deformable conv:3个deconv layer
      • 因为DCNv1里面在PASCAL上面实验发现再多的deconv精度就饱和了,但是DCNv2是在harder dataset COCO上面的best-acc-efficiency-tradeoff
    • modulated deformable conv

      • modulate the input feature amplitudes from different spacial locations/bins
        • set the learnable offset & scalar for the k-th location:$\Delta p_k$和$\Delta m_k$
        • set the conv kernel dilation:$p_k$,resnet里面都是1
        • the value for location p is:$y(p) = \sum_{k=1}^K w_k x(p+p_k+\Delta p_k)\Delta m_k$,bilinear interpolation
      • 目的是抑制无关信号
      • learnable offset & scalar obtained via a separate conv layer over the same input feature map x
      • 输出有3K个channel:2K for xy-offset,K for scalar
        • offset的conv后面没激活函数,因为范围无限
        • scalar的conv后面有个sigmoid,将range控制在[0,1]
        • 两个conv全0初始化
        • 两个conv layer的learning rate比existing layers小一个数量级
    • modulated deformable RoIpooling

      • given an input ROI
      • split into K(7x7) spatial bins
      • average pooling over the sampling points for each bin计算bin的value
      • the bin value is:$y(k) = \sum_{j=1}^{n_k} x(p_{kj}+\Delta p_k)\Delta m_k /n_k$,bilinear interpolation
      • a sibling branch
        • 2个1024d-fc:gaussian initialization with 0.01 std dev
        • 1个3Kd-fc:全0初始化
        • last K channels + sigmoid
        • learning rate跟existing layers保持一致
    • RCNN feature mimicking

      • 发现无论是conv还是deconv,error-bound都很大
      • 尽管从设计思路上,DCNv2是带有mute irrelevant的能力的,但是事实上并没做到
      • 说明such representation cannot be learned well through standard FasterRCNN training procedure:
        • 说白了就是supervision力度不够
        • 需要additional guidance
    • feature mimic loss

      • enforced only on positive ROIs:因为背景类往往需要更长距离/更大范围的context信息

      • architecture

        • add an additional RCNN branch
        • RCNN input cropped images,generate 14x14 featuremaps,经过两个fc变成1024-d
        • 和FasterRCNN里对应的counterpart,计算cosine similarity
        • 这个太扯了不展开了