SSD: Single Shot MultiBox Detector
动机
- single network
- speed & accuracy
- 59 FPS / 74.3% mAP
论点
prev methods
- two-stage:生成稀疏的候选框,然后对候选框进行分类与回归
- one-stage:均匀地在图片的不同位置,采用不同尺度和长宽比,进行密集抽样,然后利用CNN提取特征后直接进行分类与回归
fundamental speed improvement
- eliminating bounding box proposals
- eliminating feature resampling
- other improvements
- small convolutional filter for bbox categories and offsets(针对yolov1的全连接层说)
- separate predictors by aspect ratio
- multiple scales
- 这些操作都不是原创
- The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
方法
Model
Multi-scale feature maps for detection:采用了多尺度的特征图,逐渐用s2降维,大尺度特征图上有更多的单元,用来回归小物体
Convolutional predictors for detection:针对yolov1里面的fc层
Default boxes and aspect ratios:一个单元4种size的先验框,对每个先验框都预测一组4+(c+1),其中的1可以看作背景类,也可以看做是有无目标的置信度,各用一个conv3x3的head
backbone
- VGG16前四个conv block保留
- 无dropout和fc
- conv5的池化由2x2-s2变成3x3-s1
- conv6和conv7是3x3x1024和1x1x1024的空洞卷积,输出19x19x1024
- conv8是1x1x256和3x3x512 s2的conv,输出10x10x512
- conv9都是1x1x128和3x3x256 s2的conv,输出5x5x256
- conv10、conv11都是1x1x128和3x3x256 s1 p0的conv,输出3x3x256、1x1x256
- Training
- Matching strategy:match default box和gt box
- 首先为每一个gt box找到一个overlap最大的default box
- 然后找到所有与gt box的overlap大于0.5的default box
- 一个gt box可能对应多个default box
- 一个default box只能对应一个gt box(overlap最大的)
- Objective loss
- loc loss:smooth L1,offsets like Faster R-CNN
- cls loss:softmax loss
- weighted sum:$L = \frac{1}{N} (L_{cls} + \alpha L_{loc})$,
- N is the number of matched default boxes
- loss=0 when N=0
- Choosing scales and aspect ratios for default boxes
- 每个level的feature map感受野不同,default box的尺寸也不同
- 数量也不同,conv4、conv10和conv11是4个,conv7、conv8、conv9是6个
- ratio:{1,2,3,1/2,1/3},4个的没有3和1/3
- L2 normalization for conv4:
- $y_i = \frac{x_i}{\sqrt{\sum_{k=1}^n x_k^2}}$
- 作用是将不同尺度的特征都归一化成模为1的向量
- scale:可以是固定值,也可以是可学习参数
- 为啥只针对conv4?作者的另一篇paper(ParseNet)中发现conv4和其他层特征的scale是不一样的
- predictions
- all default boxes with different scales and aspect ratio from all locations of many feature maps
- significant imbalance for positive/negative
- Hard negative mining
- sort using the highest confidence loss
- pick the top ones with n/p at most 3:1
- faster optimization and a more stable training
- Data augmentation
- sample a patch with specific IoU
- resize
- Matching strategy:match default box和gt box
- 性质
- much worse performance on smaller objects, increasing the input size can help improve
- Data augmentation is crucial, resulting in a 8.8% mAP improvement
- Atrous is faster, 保留pool5不变的话,the result is about the same while the speed is about 20% slower