cornerNet

CornerNet: Detecting Objects as Paired Keypoints

  1. 动机

    • corner formulation
      • top-left corner
      • bottom-right corner
    • anchor-free
    • corner pooling
    • no multi-scale
  2. 论点

    • anchor box drawbacks

      • huge set of anchors boxes to ensure sufficient overlap,cause huge imbalance
      • hyperparameters and design choices
    • cornerNet

      • detect and group

        • heatmap to predict corners
          • 从数学表达上看,全图wh个tl corner,wh个bt corner,可以表达wwhh个框
        • anchor-based,全图wh个中心点,9个anchor size,只能表达有限的框,且可能match不上
        • embeddings to group pairs of corners

      • corner pooling

        • better localize corners which are usually out of the foreground

      • modifid hourglass architecture

      • add our novel variant of focal loss

  3. 方法

    • two prediction modules

      • heatmaps

        • C channels, C for number of categories

        • binary mask

        • each corner has only one ground-truth positive

        • penalty the neighbored negatives within a radius that still hold high iou (0.3 iou)

          • determine the radius
          • penalty reduction $=e^{-\frac{x^2+y^2}{2\sigma^2}}$
        • variant focal loss

          • $\alpha=2, \beta=4$

          • N is the number of gts

      • embeddings

        • associative embedding
        • use 1-dimension embedding
        • pull and push loss on gt positives
          • $L_{pull} = \frac{1}{N} \sum^N [(e_{tk}-e_k)^2 + (e_{bk}-e_k)^2]$
          • $L_{push} = \frac{1}{N(N-1)} \sum_j^N\sum_{k\neq j}^N max(0, \Delta -|e_k-e_j|)$
          • $e_k$ is the average of $e_{tk}$ and $e{bk}$
          • $\Delta$ = 1
      • offsets

        • 从heatmap resolution remapping到origin resolution存在精度损失
    • greatly affect the IoU of small bounding boxes

    • shared among all categories

    • smooth L1 loss on gt positives

            $$
            L_{off} = \frac{1}{N} \sum^N SmoothL1(o_k, \hat o_k)
      

      $$

    • corner pooling

      • top-left pooling layer:
          * 从当前点(i,j)开始,
          * 向下elementwise max所有feature vecor,得到$t_{i,j}$
          * 向右elementwise max所有feature vecor,得到$l_{i,j}$
          * 最后两个vector相加
        
        • bottom-right corner:向左向上

    • Hourglass Network

      • hourglass modules
        • series of convolution and max pooling layers
        • series of upsampling and convolution layers
        • skip layers
      • multiple hourglass modules stacked:reprocess the features to capture higher-level information

      • intermediate supervision

        • 常规的中继监督:

          下一级hourglass module的输入包括三个部分

          • 前一级输入
          • 前一级输出
          • 中继监督的输出
        • 本文使用了中继监督,但是没把这个结果加回去

          • hourglass2 input:1x1 conv-BN to both input and output of hourglass1 + add + relu
    • Our backbone

      • 2 hourglasses
      • 5 times downsamp with channels [256,384,384,384,512]
      • use stride2 conv instead of max-pooling
      • upsamp:2 residual modules + nearest neighbor upsampling
      • skip connection: 2 residual modules,add
      • mid connection: 4 residual modules
      • stem: 7x7 stride2, ch128 + residual stride2, ch256
      • hourglass2 input:1x1 conv-BN to both input and output of hourglass1 + add + relu
  4. 实验

    • training details
      • randomly initialized, no pretrained
      • bias:set the biases in the convolution layers that predict the corner heatmaps
      • input:511x511
      • output:128x128
      • apply PCA to the input image
      • full loss:$L = L_{det} + \alpha L_{pull} + \beta L_{push} + \gamma L_{off}$
        • 配对loss:$\alpha=\beta=0.1$
        • offset loss:$\gamma=1$
      • batch size = 49 = 4+5x9
    • test details
      • NMS:3x3 max pooling on heatmaps
      • pick:top100 top-left corners & top100 bottom-right corners
      • filter pairs:
        • L1 distance greater than 0.5
        • from different categories
      • fusion:combine the detections from the original and flipped images + soft nms
    • Ablation Study
      • corner pooling is especially helpful for medium and large objects
      • penalty reduction especially benefits medium and large objects
      • CornerNet achieves a much higher AP at 0.9 IoU than other detectors:更有能力生成高质量框
      • error analysis:the main bottleneck is detecting corners

CornerNet-Lite: Efficient Keypoint-Based Object Detection

  1. 动机

    • keypoint-based methods

      • detecting and grouping
      • accuary but with processing cost
    • propose CornerNet-Lite

      • CornerNet-Saccade:attention mechanism
      • CornerNet-Squeeze:a new compact backbone
    • performance

  2. 论点

    • main drawback of cornerNet
      • inference speed
      • reducing the number of scales or the image resolution cause a large accuracy drop
    • two orthogonal directions
      • reduce the number of pixels to process:CornerNet-Saccade
      • reduce the amount of processing per pixel:
    • CornerNet-Saccade
      • downsized attention map
      • select a subset of crops to examine in high resolution
      • for off-line:AP of 43.2% at 190ms per image
    • CornerNet-Squeeze
      • inspired by squeezeNet and mobileNet
      • 1x1 convs
      • bottleneck layers
      • depth-wise separable convolution
      • for real-time:AP of 34.4% at 30ms
    • combined??
      • CornerNet-Squeeze-Saccade turns out slower and less accurate than CornerNet- Squeeze
    • Saccades:扫视
      • to generate interesting crops
      • RCNN系列:single-type & single object
      • AutoFocus:add a branch调用faster-RCNN,thus multi-type & mixed-objects,有single branch有multi branch
      • CornerNet-Saccade:
        • single-type & multi object
        • crops can be much smaller than number of objects
  3. 方法

    • CornerNet-Saccade

      • step1:obtain possible locations

        • downsize:two scales,255 & 192,zero-padding
        • predicts 3 attention maps
          • small object:longer side<32 pixels
          • medium object:32-96
          • large object:>96
          • so that we can control the zoom-in factor:zoom-in more for smaller objects
          • feature map:different scales from the upsampling layers
          • attention map:3x3 conv-relu + 1x1 conv-sigmoid
          • process locations where scores > 0.3
      • step2:finer detection

        • zoom-in scales:4,2,1 for small、medium、large objects
        • apply CornerNet-Saccade on the ROI
          • 255x255 window
          • centered at the location
      • step3:NMS

        • soft-nms
        • remove the bounding boxes which touch the crop boundary
      • CornerNet-Saccade uses the same network for attention maps and bounding boxes

        • 在第一步的时候,对一些大目标已经有了检测框
        • 也要zoom-in,矫正一下
      • efficiency

        • regions/croped images都是processed in batch/parallel
        • resize/crop操作在GPU中实现
        • suppress redundant regions using a NMS-similar policy before prediction

    • new hourglass backbone

      • 3 hourglass module,depth 54
      • downsize twice before hourglass modules
      • downsize 3 times in each module,with channels [384,384,512]
      • one residual in both encoding path & skip connection
      • mid connection:one residual,with channels 512
    • CornerNet-Squeeze

      • to replace the heavy hourglass104
      • use fire module to replace residuals
      • downsizes 3 times before hourglass modules
      • downsize 4 times in each module
      • replace the 3x3 conv in prediction head with 1x1 conv
      • replace the nearest neighboor upsampling with 4x4 transpose conv