SSD | Less is More

SSD: Single Shot MultiBox Detector

动机
- single network
- speed & accuracy
- 59 FPS / 74.3% mAP
论点
- prev methods
  - two-stage：生成稀疏的候选框，然后对候选框进行分类与回归
  - one-stage：均匀地在图片的不同位置，采用不同尺度和长宽比，进行密集抽样，然后利用CNN提取特征后直接进行分类与回归
- fundamental speed improvement
  - eliminating bounding box proposals
  - eliminating feature resampling
- other improvements
  - small convolutional filter for bbox categories and offsets（针对yolov1的全连接层说）
  - separate predictors by aspect ratio
  - multiple scales
  - 这些操作都不是原创
- The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
方法
- Model
  - Multi-scale feature maps for detection：采用了多尺度的特征图，逐渐用s2降维，大尺度特征图上有更多的单元，用来回归小物体
  - Convolutional predictors for detection：针对yolov1里面的fc层
  - Default boxes and aspect ratios：一个单元4种size的先验框，对每个先验框都预测一组4+(c+1)，其中的1可以看作背景类，也可以看做是有无目标的置信度，各用一个conv3x3的head
  - backbone
    - 参考：https://www.cnblogs.com/sddai/p/10206929.html
  - VGG16前四个conv block保留
    - 无dropout和fc
    - conv5的池化由2x2-s2变成3x3-s1
    - conv6和conv7是3x3x1024和1x1x1024的空洞卷积，输出19x19x1024
    - conv8是1x1x256和3x3x512 s2的conv，输出10x10x512
    - conv9都是1x1x128和3x3x256 s2的conv，输出5x5x256
    - conv10、conv11都是1x1x128和3x3x256 s1 p0的conv，输出3x3x256、1x1x256
- Training
  - Matching strategy：match default box和gt box
    - 首先为每一个gt box找到一个overlap最大的default box
    - 然后找到所有与gt box的overlap大于0.5的default box
    - 一个gt box可能对应多个default box
    - 一个default box只能对应一个gt box（overlap最大的）
  - Objective loss
    - loc loss：smooth L1，offsets like Faster R-CNN
    - cls loss：softmax loss
    - weighted sum：$L = \frac{1}{N} (L_{cls} + \alpha L_{loc})$，
      - N is the number of matched default boxes
      - loss=0 when N=0
  - Choosing scales and aspect ratios for default boxes
    - 每个level的feature map感受野不同，default box的尺寸也不同
    - 数量也不同，conv4、conv10和conv11是4个，conv7、conv8、conv9是6个
    - ratio：{1,2,3,1/2,1/3}，4个的没有3和1/3
    - L2 normalization for conv4：
      - $y_i = \frac{x_i}{\sqrt{\sum_{k=1}^n x_k^2}}$
      - 作用是将不同尺度的特征都归一化成模为1的向量
      - scale：可以是固定值，也可以是可学习参数
      - 为啥只针对conv4？作者的另一篇paper(ParseNet)中发现conv4和其他层特征的scale是不一样的
  - predictions
    - all default boxes with different scales and aspect ratio from all locations of many feature maps
    - significant imbalance for positive/negative
    - Hard negative mining
      - sort using the highest confidence loss
      - pick the top ones with n/p at most 3:1
      - faster optimization and a more stable training
  - Data augmentation
    - sample a patch with specific IoU
    - resize
性质
- much worse performance on smaller objects, increasing the input size can help improve
- Data augmentation is crucial, resulting in a 8.8% mAP improvement
- Atrous is faster, 保留pool5不变的话，the result is about the same while the speed is about 20% slower