FCOS | Less is More

FCOS: Fully Convolutional One-Stage Object Detection

动机
- anchor free
- proposal free
- avoids the complicated computation related to anchor boxes
  - calculating overlapping during training
- avoid all hyper-parameters related to anchor boxes
  - size & shape
  - positive／ignored／negative
- leverage as many foreground samples as possible
论点
- anchor-based detectors
  - detection performance is sensitive to anchor settings
  - encounter difficulties in cases with large shape variations
  - hamper the generalization ability of detectors
  - dense propose：the excessive number of negative samples aggravates the imbalance
  - involve complicated computation：such as calculating the IoU with gt boxes
- FCN-based detector
  - predict a 4D vector plus a class category at each spatial location on a level of feature maps
  - do not work well when applied to overlapped bounding boxes
  - with FPN this ambiguity can be largely eliminated
- anchor-free detector
  - yolov1：only the points near the center are used，low recall
  - CornerNet：complicated post-processing to match the pairs of corners
  - DenseBox：difficulty in handling overlapping bounding boxes
- this methos
  - use FPN to deal with ambiguity
  - dense predict：use all points in a ground truth bounding box to predict the bounding box
  - introduce “center-ness” branch to predict the deviation of a pixel to the center of its corresponding bounding box
  - can be used as a RPN in two-stage detectors and can achieve significantly better performance
方法
- ground truth boxes，$B_i=(x_0, y_0, x_1, y_1, c)$，corners + cls
- anchor-free：each location (x,y)，map into abs input image (xs+[s/2], ys+[s/2])

positive sample：if a location falls into any ground-truth box
ambiguous sample：location falls into multiple gt boxes，choose the box with minimal area
regression target：l t r b distance，location to the four sides
- cls branch
  - C binary classifiers
  - C-dims vector p
- focal loss
  - $\frac{1}{N_{pos}} \sum_{x,y}L_{cls}(p_{x,y}, c_{x,y}^*)$
- calculate on both positive/negative samples
- box reg branch
  - 4-dims vector t
- IOU loss
  - $\frac{1}{N_{pos}} \sum_{x,y}1_{\{c_{x,y}^>0\}}L_{reg}(t_{x,y}, t_{x,y}^)$
- calculate on positive samples
inference
- choose the location with p > 0.05 as positive samples
- two possible issues
  - large stride makes BPR low, which is actually not a problem in FCOS
- overlaps gt boxes cause ambiguity, which can be greatly resolved with multi-level prediction
- FPN
  - P3, P4, P5：1x1 conv from C3, C4, C5, top-down connections
- P6, P7: stride2 conv from P5, P6
- limit the bbox regression for each level
  - $m_i$：maximum distance for each level
- if a location’s gt bbox satifies：$max(l^,t^,r^,b^)>m_i$ or $max(l^,t^,r^,b^)<m_{i-1}$，it is set as a negative sample，not regress at current level
  - objects with different sizes are assigned to different feature levels：largely alleviate一部分box overlapping问题
- for other overlapping cases：simply choose the gt box with minimal area
- sharing heads between different feature levels
- to regress different size range：use $exp(s_ix)$
  - trainable scalar $s_i$
- slightly improve
center-ness
- low-quality predicted bounding boxes are produced by locations far away from the center of an object
  - predict the “center-ness” of a location
  - normalized distance
    $centerness^* = \sqrt {\frac{min(l^*,r^*)}{max(l^*,r^*)}* \frac{min(t^*,b^*)}{max(t^*,b^*)}}$
- sqrt to slow down the decay
- [0,1] use bce loss
- when inference center-ness is mutiplied with the class score：can down-weight the scores of bounding boxes far from the center of an object, then filtered out by NMS
  - an alternative of the center-ness：use of only the central portion of ground-truth bounding box as positive samples，实验证明两种方法结合效果最好
- architecture
  - two minor differences from the standard RetinaNet
    - use Group Normalization in the newly added convolutional layers except for the last prediction layers
    - use P5 instead of C5 to produce P6&P7