MimicDet

[MimicDet] ResNeXt-101 backbone on the COCO: 46.1 mAP

MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection

动机
- mimic task：knowledge distillation
- mimic the two-stage features
  - a shared backbone
  - two heads for mimicking
- end-to-end training
- specialized designs to facilitate mimicking
  - dual-path mimicking
  - staggered feature pyramid
- reach two-stage accuracy
论点
- one-stage detectors adopt a straightforward fully convolutional architecture
- two-stage detectors use RPN + R-CNN
- advantages of two-stage detectors
  - avoid class imbalance
  - less proposals enables larger cls net and richer features
  - RoIAlign extracts location consistent feature -> better represenation
  - regress the object location twice -> better refined
- one-stage detectors’ imitation
  - RefineDet：cascade detection flow
  - AlignDet：RoIConv layer
  - still leaves a big gap
- network mimicking
  - knowledge distillation
  - use a well-trained large teacher model to supervise
  - difference
    - mimic in heads instead of backbones
    - teacher branch instead of model
    - trained jointly
- this method
  - not only mimic the structure design, but also imitate in the feature level
  - contains both one-stage detection head and two-stage detection head during training
    - share the same backbone
    - two-stage detection head, called T-head
    - one-stage detection head, called S-head
    - similarity loss for matching feature：guided deformable conv layer
    - together with detection losses
  - specialized designs
    - decomposed detection heads
    - conduct mimicking in classification and regression branches individually
    - staggered feature pyramid
方法
- overview
- back & fpn
  - RetinaNet fpn：with P6 & P7
  - crucial modification：P2 ～ P7
  - staggered feature pyramid
    - high-res set {P2 to P6}：for T-head & accuray
    - low-res set {P3 to P7}：for S-head & computation speed
- refinement module
  - filter out easy negatives：mitigate the class imbalance issue
  - adjust the location and size of pre-defined anchor boxes：anchor initialization
  - module
    - on top of the feature pyramid
    - one 3x3 conv
    - two sibling 1x1 convs
      - binary classification：bce loss
      - bounding box regression：the same as Faster R-CNN，L1 loss
    - top-ranked boxes transferred to T-head and S-head
  - one anchor on each position：avoid feature sharing among proposals
  - assign the objects to feature pyramid according to their scale
  - positive area：0.3 times shrinking of gt boxes from center
  - positive sample：
    - valid scale range：gt target belongs to this level
    - central point of anchor lies in the positive area
- detection heads
  - T-head
    - heavy head
    - run on a sparse set of anchor boxes
    - use the staggered feature pyramid
    - generate 7x7 location-sensitive features for each anchor box
    - cls branch
      - two 1024-d fc layers
      - one 81-d fc layer + softmax：ce loss
    - reg branch
      - four 3x3 convs，ch256
      - flatten
      - 1024-d fc
      - 4-d fc：L1 loss
    - mimicking target
      - 81-d classification logits
      - 1024-d regression feature
  - S-head
    - light-weight
    - directly dense detection on fpn
    - 【不太理解】introducing the refinement module will break the location consistency between the anchor box and its corresponding features：我的理解是refine以后的anchor和原始anchor对应的特征图misalign了，T-head用的是refined anchor，S-head用的是original grid，所以misalign
    - use deformable convolution to capture the misaligned feature
      - deformation offset is computed by a micro-network
      - takes the regression output of the refinement module as input
      - three 1x1 convs，ch64/128／18(50)
      - 3x3 Dconv for P3 and 5x5 for others，ch256
    - two sibling 1x1 convs，ch1024
      - cls branch：1x1 conv，ch80
      - reg branch：1x1 conv，ch4
- head mimicking
  - cosine similarity
  - cls logits & refine params
  - To get the S-head feature of an adjusted anchor box
    - trace back to its initial position
    - extract the pixel at that position in the feature map
  - loss：$L_{mimic} = 1 - cosine(F_i^T, F_i^S)$
- multi-task training loss
  - $L = L_R + L_S + L_T + L_{mimic}$
  - $L_R$：refine module loss，bce+L1
  - $L_S$：S-head loss，ce+L1
  - $L_T$：T-head loss，ce+L1
  - $L_{mimic}$：mimic loss
- training details
  - network：resnet50/101，resize image with shorter side 800
  - refinement module
    - run NMS with 0.8 IoU threshold on anchor boxes
    - select top 2000 boxes
  - T-head
    - sample 128 boxes from proposal
    - p／n：1/3
  - S-head
    - hard mining：select 128 boxes with top loss value
- inference
  - take top 1000 boxes from refine module
  - NMS with 0.6 IoU threshold and 0.005 score threshold
  - 【？？】finally top 100 scoring boxes：这块不太理解，最后应该不是结构化输出了啊，应该是一阶段检测头的re-refine输出啊