SSD

SSD: Single Shot MultiBox Detector

  1. 动机

    • single network
    • speed & accuracy
    • 59 FPS / 74.3% mAP
  2. 论点

    • prev methods

      • two-stage:生成稀疏的候选框,然后对候选框进行分类与回归
      • one-stage:均匀地在图片的不同位置,采用不同尺度和长宽比,进行密集抽样,然后利用CNN提取特征后直接进行分类与回归
    • fundamental speed improvement

      • eliminating bounding box proposals
      • eliminating feature resampling
    • other improvements
      • small convolutional filter for bbox categories and offsets(针对yolov1的全连接层说)
      • separate predictors by aspect ratio
      • multiple scales
      • 这些操作都不是原创
    • The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
  3. 方法

    • Model

      • Multi-scale feature maps for detection:采用了多尺度的特征图,逐渐用s2降维,大尺度特征图上有更多的单元,用来回归小物体

      • Convolutional predictors for detection:针对yolov1里面的fc层

      • Default boxes and aspect ratios:一个单元4种size的先验框,对每个先验框都预测一组4+(c+1),其中的1可以看作背景类,也可以看做是有无目标的置信度,各用一个conv3x3的head

      • backbone

      • VGG16前四个conv block保留
        • 无dropout和fc
        • conv5的池化由2x2-s2变成3x3-s1
        • conv6和conv7是3x3x1024和1x1x1024的空洞卷积,输出19x19x1024
        • conv8是1x1x256和3x3x512 s2的conv,输出10x10x512
        • conv9都是1x1x128和3x3x256 s2的conv,输出5x5x256
        • conv10、conv11都是1x1x128和3x3x256 s1 p0的conv,输出3x3x256、1x1x256
    • Training
      • Matching strategy:match default box和gt box
        • 首先为每一个gt box找到一个overlap最大的default box
        • 然后找到所有与gt box的overlap大于0.5的default box
        • 一个gt box可能对应多个default box
        • 一个default box只能对应一个gt box(overlap最大的)
      • Objective loss
        • loc loss:smooth L1,offsets like Faster R-CNN
        • cls loss:softmax loss
        • weighted sum:$L = \frac{1}{N} (L_{cls} + \alpha L_{loc})$,
          • N is the number of matched default boxes
          • loss=0 when N=0
      • Choosing scales and aspect ratios for default boxes
        • 每个level的feature map感受野不同,default box的尺寸也不同
        • 数量也不同,conv4、conv10和conv11是4个,conv7、conv8、conv9是6个
        • ratio:{1,2,3,1/2,1/3},4个的没有3和1/3
        • L2 normalization for conv4:
      • predictions
        • all default boxes with different scales and aspect ratio from all locations of many feature maps
        • significant imbalance for positive/negative
        • Hard negative mining
          • sort using the highest confidence loss
          • pick the top ones with n/p at most 3:1
          • faster optimization and a more stable training
      • Data augmentation
        • sample a patch with specific IoU
        • resize
  4. 性质
    • much worse performance on smaller objects, increasing the input size can help improve
    • Data augmentation is crucial, resulting in a 8.8% mAP improvement
    • Atrous is faster, 保留pool5不变的话,the result is about the same while the speed is about 20% slower