ATSS | Less is More

ATSS: Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

动机
- anchor-based和anchor-free方法的本质区别是对正负样本的定义，这也直接导致了performance gap
- we propose ATSS
  - adaptive training sample selection
  - automatically select positive and negative samples according to statistical characteristics of objects
  - anchor-based&anchor-free模型上都涨点
- discuss tiling multiple anchors
论点
- 主流anchor-based方法
  - one-stage/two-stage
  - tile a large number of preset anchors on the image
  - output these refined anchors as detection results
- anchor-free detectors主要分成两种
  - key-point based：预测角点/轮廓点/heatmap，然后bound轮廓得到框
  - center-based：预测中心点，然后基于中心点回归4个距离
  - 消除pre-defined anchors的hyper-params：强化generalization ability
- 举例对比RetinaNet&FCOS
  - RetinaNet：one-stage anchor-based
  - FCOS：center-based anchor-free
  - 主要区别1：anchor数量，RetinaNet是hxwx9，FCOS是hxwx1
  - 主要区别2：正样本定义，RetinaNet是与gt box的IOU大于一定阈值的anchor，FCOS是featuremap上所有落进框内的格子点
  - 主要区别3：回归方式，RetinaNet是回归gt相对pos anchor的相对偏移量，FCOS是预测四条边相对中心点的绝对距离
Difference Analysis of Anchor-based and Anchor-free Detection
- we focus on the last two differences：正负样本定义 & 回归starting status
- 设定RetinaNet也是one square anchor per location，和FCOS保持一致
- experiment setting
  - MS COCO：80类前景，common split
  - ImageNet pretrained ResNet-50
  - resize input
  - SGD，90K iterations，0.9 momentum，1e-4 weight decay，16 batch size，0.01 lr with 0.1 lr decay/60K
  - testing：
    - 0.05 score to filter out bg boxes
    - output top 1000 detections per feature pyramid
    - 0.6 IoU thresh per class NMS to give final top 100 detections per image
- inconsistency removal
  - 五大improvements加在FCOS上进一步boost the gap
  - 我们将其逐步加在RetinaNet上，能够拉到37%，和FCOS还有0.8个点的差距
- 分析essential difference
  - 训练一个检测模型，首先要分出正负样本，然后用正样本来回归
  - Classification
    - RetinaNet用anchor boxes与gt box的IoU决定正负样本：best match anchor和大于一定IoU thresh的anchor是正样本，小于一定IoU thresh的anchor是负样本，其他的是ignore样本
    - FCOS用spatial and scale constraints选择正负样本：gt box以内的所有像素作为候选正样本，然后去掉部分尺度不匹配的候选样本，正样本以外都是负样本，没有ignore
    - 两个模型在两种样本选择策略上实验：Spatial and Scale Constraint相比较于IoU都会显著提点
    - 当两种方法都使用Spatial and Scale Constraint策略选择正负样本，模型精度就没啥差别了
  - Regression
    - RetinaNet regresses from the anchor box with 4 offsets：回归gt box相对于anchor box的偏移量，regression starting status是个box
    - FCOS regresses from the anchor point with 4 distances：回归gt box四条边相对于anchor center的距离，regression starting status是个point
    - 上面那个表说明了选择同样的正负样本，regression starting status就是个无关项，不影响精度
Adaptive Training Sample Selection （ATSS）
- 影响检测模型精度的essential difference在于how to define positive and negative training samples
- previous strategies都有sensitive hyperparameters（anchors/scale），some outer objects may be neglected
- we propose ATSS
  - almost no hyper
  - divides pos/neg samples according to data statistical characteristics
  - 对每个gt box，首先在每个level上，基于L2 center distance，找到k-closest anchor——k*L个candidates per gt box
  - 计算每个candidates的mean & var
  - 基于mean & var 计算这个gt box的IoU thresh
  - 在candidates里面选取大于等于IoU thresh，同时anchor center在gt box内的，留作正样本
  - 如果一个acnhor box匹配了多个gt box，选择IoU大的那个作为标签
- 基于center distance选择anchor box：因为越靠近目标中心，越容易produce高品质框
- 用mean+var作为IoU thresh：
  - higher mean indicates high-quality candidates，对应的IoU thresh应该高一点
  - higher variation indicates level specific，mean+var作为thresh能将candidates里面IoU较高的筛选出来
- limit the anchor center in object：anchor中心不在目标框内显然不是个好框，用于筛掉前两步里的漏网之鱼，双保险
- fair between different objects
  - 统计下来每类目标都有差不多0.2kL个正样本，与尺度无关
  - 但是RetinaNet和FCOS都是大目标正样本多，小目标正样本少
- hyperparam-free：只有一个k，【还有anchor-setting呢？？？】
- verification
  - lite version：被FCOS官方引用并称作center sampling，scale limit still exists in this version
  - full version：本文版本
  - 两个方法选择candidates的方法完全一致，就是select final postives的方法不同
- hyperparam的鲁棒性
  - k在一定范围内（7-17）相对insensitive，太多了低质量框太多，太少了less statistical
  - 尝试不同的fix-ratio anchor scale和fix-scale anchor ratio，发现精度相对稳定，说明robust to anchor settings
  - multi-anchors settings
    - RetinaNet在不同的anchor setting下，精度基本不变，说明主要正样本选的好，不管一个location绑定几个anchor结果都一样