[MimicDet] ResNeXt-101 backbone on the COCO: 46.1 mAP
MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection
动机
- mimic task:knowledge distillation
- mimic the two-stage features
- a shared backbone
- two heads for mimicking
- end-to-end training
- specialized designs to facilitate mimicking
- dual-path mimicking
- staggered feature pyramid
- reach two-stage accuracy
论点
- one-stage detectors adopt a straightforward fully convolutional architecture
- two-stage detectors use RPN + R-CNN
- advantages of two-stage detectors
- avoid class imbalance
- less proposals enables larger cls net and richer features
- RoIAlign extracts location consistent feature -> better represenation
- regress the object location twice -> better refined
- one-stage detectors’ imitation
- RefineDet:cascade detection flow
- AlignDet:RoIConv layer
- still leaves a big gap
- network mimicking
- knowledge distillation
- use a well-trained large teacher model to supervise
- difference
- mimic in heads instead of backbones
- teacher branch instead of model
- trained jointly
- this method
- not only mimic the structure design, but also imitate in the feature level
- contains both one-stage detection head and two-stage detection head during training
- share the same backbone
- two-stage detection head, called T-head
- one-stage detection head, called S-head
- similarity loss for matching feature:guided deformable conv layer
- together with detection losses
- specialized designs
- decomposed detection heads
- conduct mimicking in classification and regression branches individually
- staggered feature pyramid
方法
overview
back & fpn
- RetinaNet fpn:with P6 & P7
- crucial modification:P2 ~ P7
- staggered feature pyramid
- high-res set {P2 to P6}:for T-head & accuray
- low-res set {P3 to P7}:for S-head & computation speed
refinement module
- filter out easy negatives:mitigate the class imbalance issue
- adjust the location and size of pre-defined anchor boxes:anchor initialization
- module
- on top of the feature pyramid
- one 3x3 conv
- two sibling 1x1 convs
- binary classification:bce loss
- bounding box regression:the same as Faster R-CNN,L1 loss
- top-ranked boxes transferred to T-head and S-head
- one anchor on each position:avoid feature sharing among proposals
- assign the objects to feature pyramid according to their scale
- positive area:0.3 times shrinking of gt boxes from center
- positive sample:
- valid scale range:gt target belongs to this level
- central point of anchor lies in the positive area
detection heads
- T-head
- heavy head
- run on a sparse set of anchor boxes
- use the staggered feature pyramid
- generate 7x7 location-sensitive features for each anchor box
- cls branch
- two 1024-d fc layers
- one 81-d fc layer + softmax:ce loss
- reg branch
- four 3x3 convs,ch256
- flatten
- 1024-d fc
- 4-d fc:L1 loss
- mimicking target
- 81-d classification logits
- 1024-d regression feature
- S-head
- light-weight
- directly dense detection on fpn
- 【不太理解】introducing the refinement module will break the location consistency between the anchor box and its corresponding features:我的理解是refine以后的anchor和原始anchor对应的特征图misalign了,T-head用的是refined anchor,S-head用的是original grid,所以misalign
- use deformable convolution to capture the misaligned feature
- deformation offset is computed by a micro-network
- takes the regression output of the refinement module as input
- three 1x1 convs,ch64/128/18(50)
- 3x3 Dconv for P3 and 5x5 for others,ch256
- two sibling 1x1 convs,ch1024
- cls branch:1x1 conv,ch80
- reg branch:1x1 conv,ch4
- T-head
head mimicking
- cosine similarity
- cls logits & refine params
- To get the S-head feature of an adjusted anchor box
- trace back to its initial position
- extract the pixel at that position in the feature map
- loss:$L_{mimic} = 1 - cosine(F_i^T, F_i^S)$
- multi-task training loss
- $L = L_R + L_S + L_T + L_{mimic}$
- $L_R$:refine module loss,bce+L1
- $L_S$:S-head loss,ce+L1
- $L_T$:T-head loss,ce+L1
- $L_{mimic}$:mimic loss
- training details
- network:resnet50/101,resize image with shorter side 800
- refinement module
- run NMS with 0.8 IoU threshold on anchor boxes
- select top 2000 boxes
- T-head
- sample 128 boxes from proposal
- p/n:1/3
- S-head
- hard mining:select 128 boxes with top loss value
- inference
- take top 1000 boxes from refine module
- NMS with 0.6 IoU threshold and 0.005 score threshold
- 【??】finally top 100 scoring boxes:这块不太理解,最后应该不是结构化输出了啊,应该是一阶段检测头的re-refine输出啊