[SOLO] SOLO: Segmenting Objects by Locations:字节,目前绝大多数方法实例分割的结构都是间接得到——检测框内语义分割/全图语义分割聚类,主要原因是formulation issue,很难把实例分割定义成一个结构化的问题
[SOLOv2] SOLOv2: Dynamic, Faster and Stronger:best 41.7% AP
SOLO: Segmenting Objects by Locations
- 动机 - challenging:arbitrary number of instances
- form the task into a classification-solvable problem
- direct & end-to-end & one-stage & using mask annotations solely
- on par accuracy with Mask R-CNN
- outperforming recent single-shot instance segmenters
 
- 论点 - formulating- Objects in an image belong to a fixed set of semantic categories——semantic segmentation can be easily formulated as a dense per-pixel classification problem
- the number of instances varies
 
- existing methods- 检测/聚类:step-wise and indirect
- 累积误差
 
- core idea - in most cases two instances in an image either have different center locations or have different object sizes
- location:- think image as a divided grid of cells
- an object instance is assigned to one of the grid cells as its center location category
- encode center location categories as the channel axis
 
- size- FPN
- assign objects of different sizes to different levels of feature maps
 
- SOLO converts coordinate regression into classification by discrete quantization
- One feat of doing so is the avoidance of heuristic coordination normalization and log-transformation typically used in detectors【???不懂这句话想表达啥】
 
 
- formulating
- 方法 - problem formulation - divided grids
- simultaneous task - category-aware prediction
- instance-aware mask generation
  
- category prediction - predict instance for each grid:$SSC$
- grid size:$S*S$
- number of classes:$C$
- based on the assumption that each cell must belong to one individual instance
- C-dim vec indicates the class probability for each object instance in each grid
 
- mask prediction- predict instance mask for each positive cell:$HWS^2$
- the channel corresponding to the location
- position sensitive:因为每个grid中分割的mask是要映射到对应的channel的,因此我们希望特征图是spatially variant- 让特征图spatially variant的最直接办法就是加一维spatially variant的信息
- inspired by CoordConv:添加两个通道,normed_x和normed_y,[-1,1]
- original feature tensor $HWD$ becomes $HW(D+2)$
 
 
- final results- gather category prediction & mask prediction
- NMS
 
 
- network - backbone:resnet
- FCN:256-d
- heads:weights are shared across different levels except for the last 1x1 conv  
 
- learning - positive grid:falls into a center region- mask:mask center $(c_x, c_y)$,mask size $(h,w)$
- center region:$(c_x,c_y,\epsilon w, \epsilon h)$,set $\epsilon = 0.2$
 
- loss:$L = L_{cate} + \lambda L_{seg}$- cate loss:focal loss
- seg loss:dice,$L_{mask} = \frac{1}{N_{pos}}\sum_k 1_{p^_{i,j}>0} dice(m_k, m^_k) $,带星号的是groud truth
 
 
- positive grid:falls into a center region
- inference - use a confidence threshold of 0.1 to filter out low spacial predictions 
- use a threshold of 0.5 to binary the soft masks 
- select the top 500 scoring masks 
- NMS - Only one instance will be activated at each grid
- and one in- stance may be predicted by multiple adjacent mask channels  
 
- keep top 100 
 
 
- 实验 - grid number - 适当增加有提升,主要提升还是在FPN
 
 
- fpn - 五个FPN pyramids
- 大特征图,小感受野,用来分配小目标,grid数量要增大 
 
 
- feature alignment - 在分类branch,$HW$特征图要转换成$SS$的特征图- interpolation:bilinear interpolating
- adaptive-pool:apply a 2D adaptive max-pool
- region-grid- interpolation:对每个cell,采样多个点做双线性插值,然后取平均
 
- is no noticeable performance gap between these variants
- (可能因为最终是分类任务
 
- 在分类branch,$HW$特征图要转换成$SS$的特征图
- head depth - 4-7有涨点
- 所以本文选了7
 
 
 
- decoupled SOLO - mask branch预测的channel数是$S^2$,其中大部分channel其实是没有贡献的,空占内存 
- prediction is somewhat redundant as in most cases the objects are located sparsely in the image 
- element-wise multiplication  
- 实验下来 - achieves the same performance
- efficient and equivalent variant
 
 
SOLOv2: Dynamic, Faster and Stronger
- 动机 - take one step further on the mask head- dynamically learning the mask head
- decoupled into mask kernel branch and mask feature branch
 
- propose Matrix NMS- faster & better results
 
- try object detection and panoptic segmentation
 
- take one step further on the mask head
- 论点 - SOLO develop pure instance segmentation
- instance segmentation - requires instance-level and pixel-level predictions simultaneously
- most existing instance segmentation methods build on the top of bounding boxes
- SOLO develop pure instance segmentation
 
- SOLOv2 improve SOLO- mask learning:dynamic scheme
- mask NMS:parallel matrix operations,outperforms Fast NMS
 
- Dynamic Convolutions - STN:adaptively transform feature maps conditioned on the input
- Deformable Convolutional Networks:learn location
 
 
- 方法 - revisit SOLOv1 - redundant mask prediction
- decouple
- dynamic:dynamically pick the valid ones from predicted $s^2$ classifiers and perform the convolution  
 
- SOLOv2 - dynamic mask segmentation head - mask kernel branch
- mask feature branch
 
- mask kernel branch - prediction heads:4 convs + 1 final conv,shared across scale
- no activation on the output
- concat normalized coordinates in two additional input channels at start
- ouputs D-dims kernel weights for each grid:e.g. for 3x3 conv with E input channels, outputs $SS9E$
 
- mask feature branch - predict instance-aware feature:$F \in R^{HWE}$ 
- unified and high-resolution mask feature:只输出一个尺度的特征图,encoded x32 feature with coordinates info - we feed normalized pixel coordinates to the deepest FPN level (at 1/32 scale)
- repeated 【3x3 conv, group norm, ReLU, 2x bilinear upsampling】
- element-wise sum
- last layer:1x1 conv, group norm, ReLU  
 
 
- instance mask - mask feature branch conved by the mask kernel branch:final conv $HWS^2$
- mask NMS
 
- train - loss:$L = L_{cate} + \lambda L_{seg}$- cate loss:focal loss
- seg loss:dice,$L_{mask} = \frac{1}{N_{pos}}\sum_k 1_{p^_{i,j}>0} dice(m_k, m^_k) $,带星号的是groud truth
 
 
- loss:$L = L_{cate} + \lambda L_{seg}$
- inference - category score:first use a confidence threshold of 0.1 to filter out predictions with low confidence
- mask branch:run convolution based on the filtered category map
- sigmoid
- use a threshold of 0.5 to convert predicted soft masks to binary masks
- Matrix NMS
 
- Matrix NMS - decremented functions  - linear:$f(iou_{i,j}=1-iou_{i,j})$
- gaussian:$f(iou_{i,j}=exp(-\frac{iou_{i,j}^2}{\sigma})$
 
- the most overlapped prediction for $m_i$:max iou- $f(iou_{*,i}) = min_{s_k}f(iou_{k,i})$
 
- decay factor  - $decay_i = min \frac{f(iou_{i,j})}{f(iou_{*,i})}$
 
 
 
- decremented functions