guided anchoring

原作者知乎reference：https://zhuanlan.zhihu.com/p/55854246

不完全是anchor-free，因为还是有decision grid to choose from的，应该说是adaptive anchor instead of hand-picked
为了特征和adaptive anchor对齐，引入deformable conv

Region Proposal by Guided Anchoring

动机
- most methods
  - predefined anchors
  - do a uniformed dense prediction
- our method
  - use sematic features to guide the anchoring
  - anchor size也是网络预测参数，compute from feature map
  - arbitrary aspect ratios
- feature inconsistency
  - 不同的anchor loc都是对应feature map上某一个点
  - 变化的anchor size和固定的位置向量之间存在inconsistency
  - 引入feature adaption module
- use high-quality proposals
  - GA-RPN提升了proposal的质量
  - 因此我们对proposal进入stage2的条件更严格
- adopt in Fast R-CNN, Faster R-CNN and RetinaNet均涨点
  - RPN提升显著：9.1
  - MAP也有涨点：1.2-2.7
- 还可以boosting trained models
  - boosting a two-stage detector by a fine-tuning schedule
论点
- alignment & consistency
  - 我们用feature map的pixels作为anchor representations，那么anchor centers必须跟feature pixels保持align
  - 不同pixel的reception field必须跟对应的anchor size保持匹配
  - previous sliding window scheme对每个pixel都做一样的操作，用同样一组anchor，因此是align和consist的
  - previous progressly refining scheme对anchor的位置大小做了refinement，ignore the alignment & consistency issue，是不对的！！
- disadvantage of predefined anchors
  - hard hyperparams
  - huge pos/neg imbalance & computation
- we propose GA-RPN
  - learnable anchor shapes to mitigate the hand-picked issue
  - feature adaptation to solve the consistency issue
  - key concerns in this paper
    - learnable anchors
    - joint anchor distribution
    - alignment & consistency
    - high-quality proposals
方法
- formulation
  - $p(x,y,w,h|I) = p(x,y|I)p(w,h|x,y,I)$
  - 将问题解耦成位置和尺寸的预测，首先anchor的loc服从full image的均匀分布，anchor的size建立在loc存在的基础上
  - two branches for loc & shape prediction
    - loc：binary classification，hxwx1
    - shape：location-dependent shapes，hxwx2
    - anchors：loc probabilities above a certain threshold & correponding ‘most probable’ anchor shape
  - multi-scale
    - the anchor generation parameters are shared
  - feature adaptation module
    - adapts the feature according to the anchor shape
- anchor location prediction
  - indicates the probability of an object’s center
  - 一层卷积：1x1 conv，channel1，sigmoid
  - transform back：each grid(i,j) corresponds to coords ((i+0.5)*s, (j+0.5)*s) on the origin map
  - filter out 90% of the regions
  - thus replace the ensuing conv layers by masked convs
  - groud truth
    - binary label map
    - each level：center region & ignore region & outside region，基于object center的方框
      - $\sigma_1=0.2，\sigma_2=0.5$：region box的长宽系数
      - ？？？用centerNet的heatmap会不会更好？？？
  - focal loss $L_{loc}$
- anchor shape prediction
  - predicts the best shape for each location
  - best shape：a shape that lead to best iou with the nearest gt box
  - 一层卷积：1x1 conv，channel2，[-1,1]
  - transform layer：transform direct [-1,1] outputs to real box shape
    - $w = \sigma s e^{dw}$
    - $h = \sigma s e^{dh}$
    - s：stride
    - $\sigma$：经验参数，8 in experiments
  - set 9 pairs of (w,h) as RetinaNet，calculate the IoU of these sampled anchors with gt，take the max as target value
  - bounded iou loss：$L_{shape} = L_1(1-min(\frac{w}{w_g}, \frac{w_g}{w})) + L_1(1-min(\frac{h}{h_g}, \frac{h_g}{h}))$
- feature adaptation
  - intuition：the feature corresponding to different size of anchor shapes应该encode different content region
  - inputs：feature map & anchor shape
  - location-dependent transformation：3x3 deformable conv
  - deformable conv的offset是anchor shape得到的
  - outputs：adapted features
- with adapted features
  - then perform further classification and bounding-box regression
- training
  - jointly optimize：$L = \lambda_1 L_{loc} + \lambda_2 L_{shape} + L_{cls} + L_{reg}$
  - $\lambda_1=0.2，\lambda_2=0.5$
  - each level of feature map should only target objects of a specific scale range：但是ASFF论文主张说这种arrange by scale的模式会引入前背景inconsistency？？
- High-quality Proposals
  - set a higher positive/negative threshold
  - use fewer samples