guided anchoring

原作者知乎reference:https://zhuanlan.zhihu.com/p/55854246

  • 不完全是anchor-free,因为还是有decision grid to choose from的,应该说是adaptive anchor instead of hand-picked
  • 为了特征和adaptive anchor对齐,引入deformable conv

Region Proposal by Guided Anchoring

  1. 动机

    • most methods
      • predefined anchors
      • do a uniformed dense prediction
    • our method
      • use sematic features to guide the anchoring
      • anchor size也是网络预测参数,compute from feature map
      • arbitrary aspect ratios
    • feature inconsistency
      • 不同的anchor loc都是对应feature map上某一个点
      • 变化的anchor size和固定的位置向量之间存在inconsistency
      • 引入feature adaption module
    • use high-quality proposals
      • GA-RPN提升了proposal的质量
      • 因此我们对proposal进入stage2的条件更严格
    • adopt in Fast R-CNN, Faster R-CNN and RetinaNet均涨点
      • RPN提升显著:9.1
      • MAP也有涨点:1.2-2.7
    • 还可以boosting trained models
      • boosting a two-stage detector by a fine-tuning schedule
  2. 论点

    • alignment & consistency

      • 我们用feature map的pixels作为anchor representations,那么anchor centers必须跟feature pixels保持align

      • 不同pixel的reception field必须跟对应的anchor size保持匹配

      • previous sliding window scheme对每个pixel都做一样的操作,用同样一组anchor,因此是align和consist的
      • previous progressly refining scheme对anchor的位置大小做了refinement,ignore the alignment & consistency issue,是不对的!!
    • disadvantage of predefined anchors

      • hard hyperparams
      • huge pos/neg imbalance & computation
    • we propose GA-RPN

      • learnable anchor shapes to mitigate the hand-picked issue
      • feature adaptation to solve the consistency issue
      • key concerns in this paper
        • learnable anchors
        • joint anchor distribution
        • alignment & consistency
        • high-quality proposals
  3. 方法

    • formulation

      • $p(x,y,w,h|I) = p(x,y|I)p(w,h|x,y,I)$
      • 将问题解耦成位置和尺寸的预测,首先anchor的loc服从full image的均匀分布,anchor的size建立在loc存在的基础上
      • two branches for loc & shape prediction
        • loc:binary classification,hxwx1
        • shape:location-dependent shapes,hxwx2
        • anchors:loc probabilities above a certain threshold & correponding ‘most probable’ anchor shape
      • multi-scale
        • the anchor generation parameters are shared
      • feature adaptation module

        • adapts the feature according to the anchor shape

    • anchor location prediction

      • indicates the probability of an object’s center
      • 一层卷积:1x1 conv,channel1,sigmoid
      • transform back:each grid(i,j) corresponds to coords ((i+0.5)*s, (j+0.5)*s) on the origin map
      • filter out 90% of the regions
      • thus replace the ensuing conv layers by masked convs
      • groud truth
        • binary label map
        • each level:center region & ignore region & outside region,基于object center的方框
          • $\sigma_1=0.2,\sigma_2=0.5$:region box的长宽系数
          • ???用centerNet的heatmap会不会更好???
      • focal loss $L_{loc}$
    • anchor shape prediction
      • predicts the best shape for each location
      • best shape:a shape that lead to best iou with the nearest gt box
      • 一层卷积:1x1 conv,channel2,[-1,1]
      • transform layer:transform direct [-1,1] outputs to real box shape
        • $w = \sigma s e^{dw}$
        • $h = \sigma s e^{dh}$
        • s:stride
        • $\sigma$:经验参数,8 in experiments
      • set 9 pairs of (w,h) as RetinaNet,calculate the IoU of these sampled anchors with gt,take the max as target value
      • bounded iou loss:$L_{shape} = L_1(1-min(\frac{w}{w_g}, \frac{w_g}{w})) + L_1(1-min(\frac{h}{h_g}, \frac{h_g}{h}))$
    • feature adaptation
      • intuition:the feature corresponding to different size of anchor shapes应该encode different content region
      • inputs:feature map & anchor shape
      • location-dependent transformation:3x3 deformable conv
      • deformable conv的offset是anchor shape得到的
      • outputs:adapted features
    • with adapted features
      • then perform further classification and bounding-box regression
    • training
      • jointly optimize:$L = \lambda_1 L_{loc} + \lambda_2 L_{shape} + L_{cls} + L_{reg}$
      • $\lambda_1=0.2,\lambda_2=0.5$
      • each level of feature map should only target objects of a specific scale range:但是ASFF论文主张说这种arrange by scale的模式会引入前背景inconsistency??
    • High-quality Proposals
      • set a higher positive/negative threshold
      • use fewer samples