RetinaNet

[det] RetinaNet: Focal Loss for Dense Object Detection
[det+instance seg] RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free
[det+semantic seg] Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection

Focal Loss for Dense Object Detection

动机
- dense prediction(one-stage detector)
- focal loss：address the class imbalance problem
- RetinaNet：design and train a simple dense detector
论点
- accuracy trailed
  - two-stage：classifier is applied to a sparse set of candidate
  - one-stage：dense sampling of possible object locations
  - the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause
- loss
  - standard cross entropy loss：down-weights the loss assigned to well-classified examples
  - proposed focal loss：focuses training on a sparse set of hard examples
- R-CNN系列two-stage framework
  - proposal-driven
  - the first stage generates a sparse set of candidate object locations
  - the second stage classifies each candidate location as one of the foreground classes or as background
  - class imbalance：在stage1大部分背景被filter out了，stage2训练的时候强制固定前背景样本比例，再加上困难样本挖掘OHEM
  - faster：reducing input image resolution and the number of proposals
  - ever faster：one-stage
- one-stage detectors
  - One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios
  - dense：regularly sampling(contrast to selection)，基于grid以及anchor以及多尺度
  - the training procedure is still dominated by easily classified background examples
  - class imbalance：通常引入bootstrapping和hard example mining来优化
- Object Detectors
  - Classic：sliding-window+classifier based on HOG，dense predict
  - Two-stage：selective Search+classifier based on CNN，shared network RPN
  - One-stage：‘anchors’ introduced by RPN，FPN
- loss
  - Huber loss：down-weighting the loss of outliers (hard examples)
  - focal loss：down-weighting inliers (easy examples)
方法
- focal loss
  - CE：$CE(p_t)=-log(p_t)$
    - even examples that are easily classified ($p_t>0.5$) incur a loss with non-trivial magnitude
    - summed CE loss over a large number of easy examples can overwhelm the rare class
  - WCE：$WCE(p_t)=-\alpha_t log(p_t)$
    - balances the importance of positive/negative examples
    - does not differentiate between easy/hard examples
  - FL：$FL(p_t)=-\alpha_t(1-p_t)^\gamma log(p_t)$
    - as $\gamma$ increases the modulating factor is likewise increased
    - $\gamma=2$ works best in our experiments
  - two-stage detectors通常不会使用WCE或FL
    - cascade stage会过滤掉大部分easy negatives
    - 第二阶段训练会做biased minibatch sampling
    - Online Hard Example Mining (OHEM)
      - construct minibatches using high-loss examples
      - scored by loss + nms
      - completely discards easy examples
- RetinaNet
  - compose：backbone network + two task-specific subnetworks
  - backbone：convolutional feature map over the entire input image
  - subnet1：object classification
  - subnet2：bounding box regression
  - ResNet-FPN backbone
    - rich, multi-scale feature pyramid，二阶段的RPN也用了FPN
    - each level can be used for detecting objects at a different scale
    - P3 - P7：8x - 128x downsamp
    - FPN channels：256
  - anchors
    - anchor ratios：{1:2, 1:1, 2:1}，长宽比
    - anchor scales：{$2^0$, $2^\frac{1}{3}$, $2^\frac{2}{3}$}，大小，同一个scale的anchor，面积相同，都是size*size，长宽通过ratio求得
    - anchor size per level：[32, 64, 128, 256, 512]，基本的正方形anchor的边长
    - total anchors per level：A=9
    - KA：each anchor is assigned a length K one-hot vector of classification targets
    - 4A：and a 4-vector of box regression targets
    - anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5
    - anchors are assigned background if their IoU is in [0, 0.4)
    - anchor is unassigned between [0.4, 0.5), which is ignored during training
    - each anchor is assigned to at most one object box
    - for each anchor
      - classification targets：one-hot vector
      - box regression targets：each anchor和其对应的gt box的offset
  - rpn offset：中心点、宽、高
```
  $$
```
    t_x = (x - x_a) / w_a\\
```
    t_y = (y - y_a) / h_a\\
```
    t_w = log(w/ w_a)\\
```
    t_h = log(h/ h_a)
```
    $$
  - or omitted if there is no assignment
  - 【QUESTION】所谓的anchor state {-1:ignore, 0:negative, 1:positive} 是针对cls loss来说的，相当于人为丢弃了一部分偏向中立的样本，这对分类效果有提升吗？？
  - classification subnet
    - for each spatial position，for each anchor，predict one among K classes，one-hot
    - input：C channels feature map from FPN
    - structure：four 3x3 conv + ReLU，each with C filters
    - head：3x3 conv + sigmoid，with KA filters
    - share across levels
- not share with box regression subnet
- focal loss：
  - sum over all ～100k anchors
```
  * and normalized by the number of anchors assigned to a ground-truth box
  * 因为是sum，所以要normailize，norm项用的是number of assigned anchors（这是包括了前背景？）
  * vast majority of anchors are **easy negatives** and receive negligible loss values under the focal loss（确实包含背景框）
  * $\alpha$：In general $alpha$ should be decreased slightly as $\gamma$ is increased 
```
  - strong effect on negatives：FL can effectively discount the effect of easy negatives, focusing all attention on the hard negative examples
  - box regression subnet
    - class-agnostic bounding box regressor
- same structure：four 3x3 conv + ReLU，each with C filters
```
  * head：4A linear outputs 
  * L1 loss
```

inference
- keep top 1k predictions per FPN level
```
  * all levels are merged and non-maximum suppression with a threshold of 0.5 
```
  - train
    - initialization：
      - cls head bias initialization，encourage more foreground prediction at the start of training
      - prevents the large number of background anchors from generating a large, destabilizing loss

network design

anchors

      * one-stage detecors use fixed sampling grid to generate position
      * use multiple ‘anchors’ at each spatial position to cover boxes of various scales and aspect ratios 
      * beyond 6-9 anchors did not shown further gains in AP
  * speed/accuracy trade-off  
      * outperforms all previous methods
      * bigger resolution bigger AP
      * Retina-101-600与ResNet101-FRCNN的AP持平，但是比他快

gradient：
- 梯度有界

the derivative is small as soon as $x_t > 0$

      <img src="RetinaNet/gradient.png" width="70%;" />

RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

动机
- improve single-shot detectors to the same level as current two-stage techniques
- improve on RetinaNet
  - integrating instance mask prediction
  - adaptive loss
  - additional hard examples
  - Group Normalization
- same computational cost as the original RetinaNet but more accurate：同样的参数量级比orgin RetinaNet准，整体的参数量级大于yolov3，acc快要接近二阶段的mask RCNN了

论点
- part of improvements of two-stage detectors is due to architectures like Mask R-CNN that involves multiple prediction heads
- additional segmentation task had only been added to two-stage detectors in the past
- two-stage detectors have the cost of resampling(ROI-Align) issue：RPN之后要特征对齐
- add addtional heads in training keeps the structure of the detector at test time unchanged
- potential improvement directions
  - data：OHEM
  - context：FPN
  - additional task：segmentation branch
- this paper’s contribution
  - add a mask prediction branch
  - propose a new self-adjusting loss function
  - include more of positive samples—>those with low overlap
方法
- best matching policy
  - speical case：outlier gt box，跟所有的anchor iou都不大于0.5，永远不会被当作正样本
  - use best matching anchor with any nonzero overlap to replace the threshold
- self-adjusting Smooth L1 loss
  - bbox regression
  - smooth L1：
    - L1 loss is used beyond $\beta$ to avoid over-penalizing outliers
    - the choice of control point is heuristic and is usually done by hyper parameter search
      $f(x) = \begin{cases} 0.5 \frac{x^2}{\beta} \text{, if } |x| < \beta \\ |x| - 0.5\beta \text{, otherwise } \end{cases}$
  - self-adjusting control point
    - running mean & variance
      $\mu_B = \frac{1}{n}\sum_{i=1}^n |x_i|\\ \sigma_B^2 = \frac{1}{n}\sum_{i=1}^n(|x_i|-\mu_B)^2$
    - minibatch update：m=0.9
      $\mu_R = \mu_R * m + \mu_B*(1-m)\\ \sigma_R^2 = \sigma_R^2*m+\sigma_B^2*(1-m)$
    - control point：$[0, \hat \beta]$ clip to avoid unstable
      $\beta = max(0, min(\hat \beta, \mu_R-\sigma_R^2))$
- mask module
  - detection predictions are treated as mask proposals
  - extract the top N scored predictions
  - distribute the mask proposals to sample features from the appropriate layers
    - $k_0=4$，如果size小于224*224，proposal会被分配给P3，如果大于448*448，proposal会被分配给P5
    - using more feature layers shows no performance boost
- architecture
  - r50&r101 back：freezing all of the Batch Nor- malization layers
  - fpn feature channel：256
  - classification branch
    - 4 conv layers：conv3x3+relu，channel256
    - head：conv3x3+sigmoid，channel n_anchors*n_classes
  - regression branch
    - 4 conv layers：conv3x3+relu，channel256
    - head：conv3x3，channel n_anchors*4
  - aggregate the boxes to the FPN layers
  - ROI-Align yielding 14x14 resolution features
  - mask head
    - 4 conv layers：conv3x3
    - a single transposed convolutional layer：convtranspose2d 2x2，to 28*28 resolution
    - prediction head：conv1x1
- training
  - min side & max side：800&1333
  - limited GPU：reduce the batch size，increasing the number of training iterations and reducing the learning rate accordingly
  - positive/ignore/negative：0.5，0.4
  - focal loss for classification
    - gaussian initialization
    - $\alpha=0.25, \lambda=2.0$
    - $FL=-\alpha_t(1-p_t)^\lambda log(p_t)$
      $FL = \left\{ \begin{array}{lr} -\alpha (1-p)^{\gamma}log(p), \ \ y=1\\ -(1-\alpha) p^{\gamma}log(1-p), \ \ y=0\\ \end{array} \right.$
    - gamma项控制的是简单样本的衰减速度，alpha项控制的是正负样本比例，可以默认值下正样本的权重是0.25，负样本的权重是0.75，和想象中的给正样本更多权重不一样，因为alpha和gamma是耦合起来作用的，（可能检测场景下困难的负样本相比于正样本更少？背景就是比前景好学？不确定不确定。。。）
  - self-adjusting L1 loss for box regression
    - limit running params：[0, 0.11]
  - mask loss
    - top-100 predicted boxes + ground truth boxes
- inference
  - box confidence threshold 0.05
  - nms threshold 0.4
  - use top-50 boxes for mask prediction

Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection

动机
- localization
  - pixel-level predict
  - ad-hoc heuristics when mapping back to object-level scores
- semantic segmentation
  - auxiliary task
  - overall one-stage
  - leveraging available supervision signals
论点
- monitoring pixel-wise predictions are clinically required
- medical annotations is commonly performed in pixel- wise
- full semantic supervision
  - fully exploiting the available semantic segmentation signal results in significant performance gains
- one-stage
  - explicit scale variance enforced by the resampling operation in two-stage detectors is not helpful in the medical domain
- two-stage methods
  - predict proposal-based segmentations
  - mask loss is only evaluated on cropped proposal：no context gradients
  - ROI-Align：not suggested in medical image
  - depends on the results of region proposal：serial vs parallel
  - gradients of the mask loss do not flow through the entire model
方法
- model
  - back：
    - ResNet50
  - fpn：
    - shift p3-p6 to p2-p5
    - change sigmoid to softmax
    - 3d head channels：64
    - anchor size：$\{P_2: 4^2, P_3: 8^2,, P_4: 16^2,, P_5: 32^2\}$
    - 3d z-scale：{1，2，4，8}，考虑到z方向的low resolution
  - segmentation supervision
    - p0 & p1
    - with skip connections
    - without detection heads
    - segmentation loss calculates on p0 logits
    - dice + ce
  - h
- weighted box clustering
  - patch crop
  - tiling strategies & model ensembling causes multi predictions per location
  - nms选了一类中score最大的box，然后抑制所有与它同类的IoU大于一定阈值的box
  - weighted box作用于这一类所有的box，计算一个融合的结果
    - coordinates confidence：$o_c = \frac{\sum c_i s_i w_i}{\sum s_i w_i}$
    - score confidence：$o_s = \frac{\sum s_i w_i}{\sum w_i + n_{missing * \overline w}}$
    - $w_i$：$w=f a p$
      - overlap factor f：与highest scoring box的overlap
      - area factor a：higher weights to larger boxes，经验
      - patch center factor p：相对于patch center的正态分布
    - score confidence的分母上有一个down-weight项$n_{missing}$：基于prior knowledge预期prediction的总数得到
  - 论文给的例子让我感觉好比nms的点
    - 一个cluster里面一类最终就留下一个框：解决nms一类大框包小框的情况
    - 这个location上prediction明显少于prior knowledge的类别confidence会被显著拉低：解决一个位置出现大概率假阳框的情况