SSD: Single Shot MultiBox Detector
动机
- single network
 - speed & accuracy
 - 59 FPS / 74.3% mAP
 
论点
prev methods
- two-stage:生成稀疏的候选框,然后对候选框进行分类与回归
 - one-stage:均匀地在图片的不同位置,采用不同尺度和长宽比,进行密集抽样,然后利用CNN提取特征后直接进行分类与回归
 
fundamental speed improvement
- eliminating bounding box proposals
 - eliminating feature resampling
 
- other improvements 
- small convolutional filter for bbox categories and offsets(针对yolov1的全连接层说)
 - separate predictors by aspect ratio
 - multiple scales
 - 这些操作都不是原创
 
 - The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
 
方法
Model
Multi-scale feature maps for detection:采用了多尺度的特征图,逐渐用s2降维,大尺度特征图上有更多的单元,用来回归小物体
Convolutional predictors for detection:针对yolov1里面的fc层
Default boxes and aspect ratios:一个单元4种size的先验框,对每个先验框都预测一组4+(c+1),其中的1可以看作背景类,也可以看做是有无目标的置信度,各用一个conv3x3的head
backbone
- VGG16前四个conv block保留
- 无dropout和fc
 - conv5的池化由2x2-s2变成3x3-s1
 - conv6和conv7是3x3x1024和1x1x1024的空洞卷积,输出19x19x1024
 - conv8是1x1x256和3x3x512 s2的conv,输出10x10x512
 - conv9都是1x1x128和3x3x256 s2的conv,输出5x5x256
 - conv10、conv11都是1x1x128和3x3x256 s1 p0的conv,输出3x3x256、1x1x256
 
 
- Training 
- Matching strategy:match default box和gt box
- 首先为每一个gt box找到一个overlap最大的default box
 - 然后找到所有与gt box的overlap大于0.5的default box
 - 一个gt box可能对应多个default box
 - 一个default box只能对应一个gt box(overlap最大的)
 
 - Objective loss 
- loc loss:smooth L1,offsets like Faster R-CNN
 - cls loss:softmax loss
 - weighted sum:$L = \frac{1}{N} (L_{cls} + \alpha L_{loc})$,
- N is the number of matched default boxes
 - loss=0 when N=0
 
 
 - Choosing scales and aspect ratios for default boxes 
- 每个level的feature map感受野不同,default box的尺寸也不同
 - 数量也不同,conv4、conv10和conv11是4个,conv7、conv8、conv9是6个
 - ratio:{1,2,3,1/2,1/3},4个的没有3和1/3
 - L2 normalization for conv4:
- $y_i = \frac{x_i}{\sqrt{\sum_{k=1}^n x_k^2}}$
 - 作用是将不同尺度的特征都归一化成模为1的向量
 - scale:可以是固定值,也可以是可学习参数
 - 为啥只针对conv4?作者的另一篇paper(ParseNet)中发现conv4和其他层特征的scale是不一样的
 
 
 - predictions
- all default boxes with different scales and aspect ratio from all locations of many feature maps
 - significant imbalance for positive/negative
 - Hard negative mining
- sort using the highest confidence loss
 - pick the top ones with n/p at most 3:1
 - faster optimization and a more stable training
 
 
 - Data augmentation 
- sample a patch with specific IoU
 - resize
 
 
 - Matching strategy:match default box和gt box
 
- 性质
- much worse performance on smaller objects, increasing the input size can help improve
 - Data augmentation is crucial, resulting in a 8.8% mAP improvement
 - Atrous is faster, 保留pool5不变的话,the result is about the same while the speed is about 20% slower