FCIS | Less is More

Fully Convolutional Instance-aware Semantic Segmentation

动机
- instance segmentation：
  - 实例分割比起检测，需要得到目标更精确的边界信息
  - 比起语义分割，需要区分不同的物体
- detects and segments simultanously
- FCN + instance mask proposal
论点
- FCNs do not work for the instance-aware semantic segmentation task
  - convolution is translation invariant：权值共享，一个像素值对应一个响应值，与位置无关
- instance segmentation operates on region level
  - the same pixel can have different semantics in different regions
  - Certain translation-variant property is required
- prevalent method
  - step1: an FCN is applied on the whole image to generate shared feature maps
  - step2: a pooling layer warps each region of interest into fixed-size per-ROI feature maps
  - step3: use fc layers to convert the per-ROI feature maps to per-ROI masks
  - the translation-variant property is introduced in the fc layer(s) in the last step
  - drawbacks
    - the ROI pooling step losses spatial details
    - the fc layers over-parametrize the task
- InstanceFCN
  - position-sensitive score maps
  - sliding windows
  - sub-tasks are separated and the solution is not end-to-end
  - blind to the object categories：前背景分割
- In this work
  - extends InstanceFCN
  - end-to-end
  - fully convolutional
  - operates on box proposals instead of sliding windows
  - per-ROI computation does not involve any warping or resizing operations
方法
- position-sensitive score map
  - FCN
    - predict a single score map
    - predict each pixel’s likelihood score of belonging to each category
  - at instance level
    - the same pixel can be foreground on one object but background on another
    - a single score map per-category is insufficient to distinguish these two cases
  - a fully convolutional solution for instance mask proposal
    - k x k evenly partitioned cells of object
    - thus obtain k x k position-sensitive score maps
    - Each score represents 当前像素在当前位置（score map在cells中的位置）上属于某个物体实例的似然得分
    - assembling (copy-paste)
- jointly and simultaneously
  - The same set of score maps are shared for the two sub-tasks
  - For each pixel in a ROI, there are two tasks:
    - detection：whether it belongs to an object bounding box
    - segmentation：whether it is inside an object instance’s boundary
    - separate：two 1x1 conv heads
    - fuse：inside and outside
      - high inside score and low outside score：detection+, segmentation+
      - low inside score and high outside score：detection+, segmentation-
      - low inside score and low outside score：detection-, segmentation-
      - detection score
        
        average pooling over all pixels‘ likelihoods for each class
        
        max(detection score) represent the object
      - segmentation
        
        softmax(inside, outside) for each pixel to distinguish fg／bg
  - All the per-ROI components are implemented through convs
    - local weight sharing property：a regularization mechanism
    - without involving any feature warping, resizing or fc layers
    - the per-ROI computation cost is negligible
- architecture
  - ResNet back produce features with 2048 channels
  - a 1x1 conv reduces the dimension to 1024
  - x16 output stride：conv5 stride is decreased from 2 to 1, the dilation is increased from 1 to 2
  - head1：joint det conf & segmentation
    - 1x1 conv，generates $2k^2(C+1)$ score maps
    - 2 for inside／outside
    - $k^2$ for $k^2$个position
    - $(C+1)$ for fg／bg
  - head2：bbox regression
    - 1x1 conv，$4k^2$ channels
  - RPN to generate ROIs
  - inference
    - 300 ROIs
    - pass through the bbox regression obtaining another 300 ROIs
    - pass through joint head to obtain detection score&fg mask for all categories
    - mask voting：每个ROI (with max det score) 只包含当前类别的前景，还要补上框内其他类别背景
      - for current ROI, find all the ROIs (from the 600) with IoU scores higher than 0.5
      - their fg masks are averaged per-pixel and weighted by the classification score
  - training
    - ROI positive／negative：IoU>0.5
    - loss
      - softmax detection loss over C+1 categories
      - softmax segmentation loss over the gt fg mask, on positive ROIs
      - bbox regression loss, , on positive ROIs
    - OHEM：among the 300 proposed ROIs on one image, 128 ROIs with the highest losses are selected to back-propagate their error gradients
    - RPN：
      - 9 anchors
      - sharing feature between FCIS and RPN
实验
- metric：mAP
- FCIS (translation invariant)：
  - set k=1，achieve the worst mAP
  - indicating the position sensitive score map is vital for this method
- back
  - 50-101：increase
  - 101-152：saturate
- tricks

* r