
[MimicDet] ResNeXt-101 backbone on the COCO: 46.1 mAP

MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection

  1. 动机

    • mimic task:knowledge distillation
    • mimic the two-stage features
      • a shared backbone
      • two heads for mimicking
    • end-to-end training
    • specialized designs to facilitate mimicking
      • dual-path mimicking
      • staggered feature pyramid
    • reach two-stage accuracy
  2. 论点

    • one-stage detectors adopt a straightforward fully convolutional architecture
    • two-stage detectors use RPN + R-CNN
    • advantages of two-stage detectors
      • avoid class imbalance
      • less proposals enables larger cls net and richer features
      • RoIAlign extracts location consistent feature -> better represenation
      • regress the object location twice -> better refined
    • one-stage detectors’ imitation
      • RefineDet:cascade detection flow
      • AlignDet:RoIConv layer
      • still leaves a big gap
    • network mimicking
      • knowledge distillation
      • use a well-trained large teacher model to supervise
      • difference
        • mimic in heads instead of backbones
        • teacher branch instead of model
        • trained jointly
    • this method
      • not only mimic the structure design, but also imitate in the feature level
      • contains both one-stage detection head and two-stage detection head during training
        • share the same backbone
        • two-stage detection head, called T-head
        • one-stage detection head, called S-head
        • similarity loss for matching feature:guided deformable conv layer
        • together with detection losses
      • specialized designs
        • decomposed detection heads
        • conduct mimicking in classification and regression branches individually
        • staggered feature pyramid
  3. 方法

    • overview

    • back & fpn

      • RetinaNet fpn:with P6 & P7
      • crucial modification:P2 ~ P7
      • staggered feature pyramid
        • high-res set {P2 to P6}:for T-head & accuray
        • low-res set {P3 to P7}:for S-head & computation speed
    • refinement module

      • filter out easy negatives:mitigate the class imbalance issue
      • adjust the location and size of pre-defined anchor boxes:anchor initialization
      • module
        • on top of the feature pyramid
        • one 3x3 conv
        • two sibling 1x1 convs
          • binary classification:bce loss
          • bounding box regression:the same as Faster R-CNN,L1 loss
        • top-ranked boxes transferred to T-head and S-head
      • one anchor on each position:avoid feature sharing among proposals
      • assign the objects to feature pyramid according to their scale
      • positive area:0.3 times shrinking of gt boxes from center
      • positive sample:
        • valid scale range:gt target belongs to this level
        • central point of anchor lies in the positive area
    • detection heads

      • T-head
        • heavy head
        • run on a sparse set of anchor boxes
        • use the staggered feature pyramid
        • generate 7x7 location-sensitive features for each anchor box
        • cls branch
          • two 1024-d fc layers
          • one 81-d fc layer + softmax:ce loss
        • reg branch
          • four 3x3 convs,ch256
          • flatten
          • 1024-d fc
          • 4-d fc:L1 loss
        • mimicking target
          • 81-d classification logits
          • 1024-d regression feature
      • S-head
        • light-weight
        • directly dense detection on fpn
        • 【不太理解】introducing the refinement module will break the location consistency between the anchor box and its corresponding features:我的理解是refine以后的anchor和原始anchor对应的特征图misalign了,T-head用的是refined anchor,S-head用的是original grid,所以misalign
        • use deformable convolution to capture the misaligned feature
          • deformation offset is computed by a micro-network
          • takes the regression output of the refinement module as input
          • three 1x1 convs,ch64/128/18(50)
          • 3x3 Dconv for P3 and 5x5 for others,ch256
        • two sibling 1x1 convs,ch1024
          • cls branch:1x1 conv,ch80
          • reg branch:1x1 conv,ch4
    • head mimicking

      • cosine similarity
      • cls logits & refine params
      • To get the S-head feature of an adjusted anchor box
        • trace back to its initial position
        • extract the pixel at that position in the feature map
      • loss:$L_{mimic} = 1 - cosine(F_i^T, F_i^S)$
    • multi-task training loss
      • $L = L_R + L_S + L_T + L_{mimic}$
      • $L_R$:refine module loss,bce+L1
      • $L_S$:S-head loss,ce+L1
      • $L_T$:T-head loss,ce+L1
      • $L_{mimic}$:mimic loss
    • training details
      • network:resnet50/101,resize image with shorter side 800
      • refinement module
        • run NMS with 0.8 IoU threshold on anchor boxes
        • select top 2000 boxes
      • T-head
        • sample 128 boxes from proposal
        • p/n:1/3
      • S-head
        • hard mining:select 128 boxes with top loss value
    • inference
      • take top 1000 boxes from refine module
      • NMS with 0.6 IoU threshold and 0.005 score threshold
      • 【??】finally top 100 scoring boxes:这块不太理解,最后应该不是结构化输出了啊,应该是一阶段检测头的re-refine输出啊