[SOLO] SOLO: Segmenting Objects by Locations:字节,目前绝大多数方法实例分割的结构都是间接得到——检测框内语义分割/全图语义分割聚类,主要原因是formulation issue,很难把实例分割定义成一个结构化的问题

[SOLOv2] SOLOv2: Dynamic, Faster and Stronger:best 41.7% AP

SOLO: Segmenting Objects by Locations

  1. 动机

    • challenging:arbitrary number of instances
    • form the task into a classification-solvable problem
    • direct & end-to-end & one-stage & using mask annotations solely
    • on par accuracy with Mask R-CNN
    • outperforming recent single-shot instance segmenters
  2. 论点

    • formulating
      • Objects in an image belong to a fixed set of semantic categories——semantic segmentation can be easily formulated as a dense per-pixel classification problem
      • the number of instances varies
    • existing methods
      • 检测/聚类:step-wise and indirect
      • 累积误差
    • core idea
      • in most cases two instances in an image either have different center locations or have different object sizes
      • location:
        • think image as a divided grid of cells
        • an object instance is assigned to one of the grid cells as its center location category
        • encode center location categories as the channel axis
      • size
        • FPN
        • assign objects of different sizes to different levels of feature maps
      • SOLO converts coordinate regression into classification by discrete quantization
      • One feat of doing so is the avoidance of heuristic coordination normalization and log-transformation typically used in detectors【???不懂这句话想表达啥】
  3. 方法

    • problem formulation

      • divided grids
      • simultaneous task

        • category-aware prediction
        • instance-aware mask generation

      • category prediction

        • predict instance for each grid:$SSC$
        • grid size:$S*S$
        • number of classes:$C$
        • based on the assumption that each cell must belong to one individual instance
        • C-dim vec indicates the class probability for each object instance in each grid
      • mask prediction
        • predict instance mask for each positive cell:$HWS^2$
        • the channel corresponding to the location
        • position sensitive:因为每个grid中分割的mask是要映射到对应的channel的,因此我们希望特征图是spatially variant
          • 让特征图spatially variant的最直接办法就是加一维spatially variant的信息
          • inspired by CoordConv:添加两个通道,normed_x和normed_y,[-1,1]
          • original feature tensor $HWD$ becomes $HW(D+2)$
      • final results
        • gather category prediction & mask prediction
        • NMS
    • network

      • backbone:resnet
      • FCN:256-d
      • heads:weights are shared across different levels except for the last 1x1 conv

    • learning

      • positive grid:falls into a center region
        • mask:mask center $(c_x, c_y)$,mask size $(h,w)$
        • center region:$(c_x,c_y,\epsilon w, \epsilon h)$,set $\epsilon = 0.2$
      • loss:$L = L_{cate} + \lambda L_{seg}$
        • cate loss:focal loss
        • seg loss:dice,$L_{mask} = \frac{1}{N_{pos}}\sum_k 1_{p^_{i,j}>0} dice(m_k, m^_k) $,带星号的是groud truth
    • inference

      • use a confidence threshold of 0.1 to filter out low spacial predictions

      • use a threshold of 0.5 to binary the soft masks

      • select the top 500 scoring masks

      • NMS

        • Only one instance will be activated at each grid
        • and one in- stance may be predicted by multiple adjacent mask channels

      • keep top 100

  4. 实验

    • grid number

      • 适当增加有提升,主要提升还是在FPN
    • fpn

      • 五个FPN pyramids
      • 大特征图,小感受野,用来分配小目标,grid数量要增大

    • feature alignment

      • 在分类branch,$HW$特征图要转换成$SS$的特征图
        • interpolation:bilinear interpolating
        • adaptive-pool:apply a 2D adaptive max-pool
        • region-grid- interpolation:对每个cell,采样多个点做双线性插值,然后取平均
      • is no noticeable performance gap between these variants
      • (可能因为最终是分类任务
    • head depth

      • 4-7有涨点
      • 所以本文选了7
  5. decoupled SOLO

    • mask branch预测的channel数是$S^2$,其中大部分channel其实是没有贡献的,空占内存

    • prediction is somewhat redundant as in most cases the objects are located sparsely in the image

    • element-wise multiplication

    • 实验下来

      • achieves the same performance
      • efficient and equivalent variant

SOLOv2: Dynamic, Faster and Stronger

  1. 动机

    • take one step further on the mask head
      • dynamically learning the mask head
      • decoupled into mask kernel branch and mask feature branch
    • propose Matrix NMS
      • faster & better results
    • try object detection and panoptic segmentation
  2. 论点

    • SOLO develop pure instance segmentation
    • instance segmentation
      • requires instance-level and pixel-level predictions simultaneously
      • most existing instance segmentation methods build on the top of bounding boxes
      • SOLO develop pure instance segmentation
    • SOLOv2 improve SOLO
      • mask learning:dynamic scheme
      • mask NMS:parallel matrix operations,outperforms Fast NMS
    • Dynamic Convolutions
      • STN:adaptively transform feature maps conditioned on the input
      • Deformable Convolutional Networks:learn location
  3. 方法

    • revisit SOLOv1

      • redundant mask prediction
      • decouple
      • dynamic:dynamically pick the valid ones from predicted $s^2$ classifiers and perform the convolution

    • SOLOv2

      • dynamic mask segmentation head

        • mask kernel branch
        • mask feature branch
      • mask kernel branch

        • prediction heads:4 convs + 1 final conv,shared across scale
        • no activation on the output
        • concat normalized coordinates in two additional input channels at start
        • ouputs D-dims kernel weights for each grid:e.g. for 3x3 conv with E input channels, outputs $SS9E$
      • mask feature branch

        • predict instance-aware feature:$F \in R^{HWE}$

        • unified and high-resolution mask feature:只输出一个尺度的特征图,encoded x32 feature with coordinates info

          • we feed normalized pixel coordinates to the deepest FPN level (at 1/32 scale)
          • repeated 【3x3 conv, group norm, ReLU, 2x bilinear upsampling】
          • element-wise sum
          • last layer:1x1 conv, group norm, ReLU

      • instance mask

        • mask feature branch conved by the mask kernel branch:final conv $HWS^2$
        • mask NMS
      • train

        • loss:$L = L_{cate} + \lambda L_{seg}$
          • cate loss:focal loss
          • seg loss:dice,$L_{mask} = \frac{1}{N_{pos}}\sum_k 1_{p^_{i,j}>0} dice(m_k, m^_k) $,带星号的是groud truth
      • inference

        • category score:first use a confidence threshold of 0.1 to filter out predictions with low confidence
        • mask branch:run convolution based on the filtered category map
        • sigmoid
        • use a threshold of 0.5 to convert predicted soft masks to binary masks
        • Matrix NMS
      • Matrix NMS

        • decremented functions
          • linear:$f(iou_{i,j}=1-iou_{i,j})$
          • gaussian:$f(iou_{i,j}=exp(-\frac{iou_{i,j}^2}{\sigma})$
        • the most overlapped prediction for $m_i$:max iou
          • $f(iou_{*,i}) = min_{s_k}f(iou_{k,i})$
        • decay factor
          • $decay_i = min \frac{f(iou_{i,j})}{f(iou_{*,i})}$