

  • [centerNet] 真centerNet: Objects as Points,utexas,这个是真的centerNet,基于分割架构,预测中心点的heatmap,以及2-N个channel其他相关参数的回归

  • [cornet-centerNet] centerNet: Keypoint Triplets for Object Detection,这个抢先叫了centerNet,但是我觉得叫corner-centerNet更合适,它是基于cornerNet衍生的,在cornerNet的基础上再加一刀判定,基于角点pair的中心点是否是前景来决定是否保留这个框

  • [centerNet2] Probabilistic two-stage detection,utexas,

centerNet: Objects as Points

  1. 动机

    • anchor-based

      • exhaustive list of potential locations
      • wasteful, inefficient, requires additional post-processing
    • our detector

      • center:use keypoint estimation to find center points
      • other properties:regress
    • tasks

      • object detection
      • 3d object detection
      • multi-person human pose estimation

  2. 论点

    • 相比较于传统一阶段、二阶段检测
      • anchor:
        • box & kp:一个是框,一个是击中格子
        • nms:take local peaks,no need of nms
        • larger resolution:hourglass架构,输出x4的heatmap,eliminates the need for multiple anchors
    • 相比较于key point estimantion network
      • them:require grouping stage
      • our:只定位一个center point,no need for group or post-processing
  3. 方法

    • loss

      • 关键点loss

        • center point关键点定义:每个目标的gt point只有一个,以它为中心,做object size-adaptive的高斯penalty reduction,overlap的地方取max

        • focal loss:基本与cornetNet一致

          • $\alpha=2, \beta=4$
          • background points有penalty,根据gt的高斯衰减来的
      • offset loss

        • 只有两个通道(x_offset & y_offset):shared among categories
        • gt的offset是原始resolution/output stride向下取整得到
        • L1 loss
    • centerNet

      • output

        • 第一个部分:中心点,[h,w,c],binary mask for each category
        • 第二个部分:offset,[h,w,2],shared among
        • 第三个部分:size,[h,w,2],shared among
          • L1 loss,use raw pixel coordinates
        • overall
          • C+4 channels,跟传统检测的formulation是一致的,只不过传统检测gt是基于anchor计算的相对值,本文直接回归绝对值
          • $L_{det} = L_k + \lambda_{size} L_{size} + \lambda_{off} L_{off}$
          • 其他task的formulation看第一张图
      • inference workflow

        • local peaks:
          • for each category channel
          • all responses greater or equal to its 8-connected neighbors:3x3 max pooling
          • keep the top100
        • generate bounding boxes
          • 组合offset & size predictions
          • ????没有后处理了???假阳????
      • encoder-decoder backbone:x4

        • hourglass104
          • stem:x4
          • modules:两个
        • resnet18/101+deformable conv upsampling
          • 3x3 deformable conv, 256/128/64
          • bilinear interpolation
        • DLA34+deformable conv upsampling

      • heads

        • independent heads
        • one 3x3 conv,256
        • 1x1 conv for prediction
  4. 总结


    • center的回归对标confidence的回归,区别在于高斯/[0,1]/[0,-1,1]
    • size的回归变成了raw pixel,不再基于anchor
    • hourglass结构就是fpn,级联的hourglass可以对标bi-fpn
    • 多尺度变成了单一大resolution特征图,也可以用多尺度预测,需要加NMS

centerNet2: Probabilistic two-stage detection

  1. 动机

    • two-stage
    • probabilistic interpretation
    • the suggested pipeline
      • stage1:infer proper object-backgroud likelihood,专注前背景分离
      • stage2:inform the overall score
    • verified on COCO
      • faster and more accurate than both one and two stage detectors
      • outperform yolov4
      • extreme large model:56.4 mAP
      • standard ResNeXt- 32x8d-101-DCN back:50.2 mAP
  2. 论点

    • one-stage detectors
      • dense predict
      • jointly predict class & location
      • anchor-based:RetinaNet用focal loss来deal with 前背景imbalance
      • anchor-free:FCOS & CenterNet不基于anchor基于grid,缓解imbalance
      • deformable conv:AlignDet在output前面加一层deformable conv to get richer features
      • sound probablilistic interpretation
      • heavier separate classification and regression branches than two-stage models:如果类别特别多的情况,头会非常重,严重影响性能
      • misaligned issue:一阶段预测是基于local feature,感受野、anchor settings都会影响与目标的对齐程度
    • two-stage detectors
      • first RPN generates coarse object proposals
      • then per-region head to classify and refine
      • ROI heads:Faster-RCNN用了两个fc层作为ROI heads
      • cascaded:CascadeRCNN用了三个连续的Faster-RCNN,with a different positive threshold
      • semantic branch:HTC用了额外的分割分支enhance the inter-stage feature flow
      • decouple:TSD将cls&pos两个ROI heads解耦
      • weak RPN:因为尽可能提升召回率,proposal score也不准,丧失了一个clear的probabilistic interpretation
      • independent probabilistic interpretation:两个阶段各训各的,最后的cls score仅用第二阶段的
      • slow:proposals太多了所以slow down
    • other detectors
      • point-based:cornetNet预测&组合两个角点,centerNet预测中心点并基于它回归长宽
      • transformer:DETR直接预测a set of bounding boxes,而不是传统的结构化的dense output
    • 网络结构

      • one/two-stage detectors:image classification network + lightweight upsampling layers + heads
      • point-based:FCN,有symmetric downsampling and upsampling layer,预测一个小stride的heatmap
      • DETR:feature extraction + transformer decoder
    • our method

      • 第一个阶段
        • 做二分类的one-stage detector,提前景,
        • 实现上就用region-level feature+classifier(FCN-based)
      • 第二阶段
        • 做position-based类别预测
        • 实现上既可以用一个Faster-RCNN,也可以用classifier
      • 最终的loss由两个阶段合并得到,而不是分阶段训练
      • 跟former two-stage framework的主要不同是
        • 加了joint probabilistic objective over both stages
        • 以前的二阶段RPN的用途主要是最大化recall,does not produce accurate likelihoods
      • faster and more accurate
        • 首先是第一个阶段的proposal更少更准
        • 其次是第二个阶段makes full use of years of progress in two-stage detection,二阶段的设计站在伟人的肩膀上
  3. 方法

    • joint class distribution:先介绍怎么将一二阶段联动
      • 【第一阶段的前背景score】 乘上 【第二阶段的class score】
      • $P(C_k) = \sum_o P(C_k|O_k=o)P(O_k=o)$
      • maximum likelihood estimation
        • for annotated objects
          • 退阶成independent maximum-likelihood
          • $log P(C_k) = log P(C_k|O_k=1) + log P(O_k=1)$
        • for background class
          • 不分解
          • $log P(bg) = log( P(bg|O_k=1) * P(O_k=1) + P(O_k=0))$
          • lower bounds,基于jensen不等式得到两个不等式
          • $log P(bg) \ge P(O_k=1) * log( P(bg|O_k=1))$:如果一阶段前景率贼大,那么就
          • $log P(bg) \ge P(O_k=0)$:
          • optimize both bounds jointly works better
    • network design:介绍怎么在one-stage detector的基础上改造出一个two-stage probabilistic detector

      • experiment with 4 different designs for first-stage RPN
      • RetinaNet
        • RetinaNet其实和two-stage的RPN高度相似核心区别在于:
          • a heavier head design:4-conv vs 1-conv
            • RetinaNet是backbone+fpn+individual heads
            • RPN是backbone+fpn+shared convs+individual heads
          • a stricter positive and negative anchor definition:都是IoU-based anchor selection,thresh不一样
          • focal loss
        • first-stage design
          • 以上三点都在probabilistic model里面保留
          • 然后将separated heads改成shared heads
      • centerNet
        • 模型升级
          • 升级成multi-scale:use ResNet-FPN back,P3-P7
          • 头是FCOS那种头:individual heads,不share conv,然后cls branch预测centerness+cls,reg branch预测regress params
          • 正样本也是按照FCOS策略:position & scale-based
        • 升级模型进行one-stage & two-stage实验:centerNet*
      • ATSS
        • 是一个adaptive IoU thresh的方法,centerness来表示一个格子的score
        • 我们将centerness*classification score定义为这个模型的proposal score
        • 另外就是two-stage下还是将RPN的cls & reg heads合并
      • GFL:还没看过,先跳过吧
      • second-stage designs:FasterRCNN & CascadeR- CNN
      • deformable conv:这个在centerNetv1的ResNet和DLA back里面都用了,在v2里面,主要是用ResNeXt-32x8d-101-DCN,
    • hyperparameters for two-stage probabilistic model

      • 两阶段模型通常是用P2-P6,一阶段通常用P3-P7:我们用P3-P7
      • increase the positive IoU threshold:0.5 to [0.6,0.7,0.8]
      • maximum of 256 proposals (对比origin的1k)
      • increase nms threshold from 0.5 to 0.7
      • SGD,90K iterations
      • base learning rate:0.02 for two-stage & 0.01 for one-stage,0.1 decay
      • multi-scale training:短边[640,800],长边不超过1333
      • fix-scale testing:短边用800,长边不超过1333
      • first stage loss weight:0.5,因为one-stage detector通常用0.01 lr开始训练
  4. 实验

    • 4种design的对比

      • 所有的probabilistic model都比one-stage model强,甚至还快(因为简化了脑袋)
      • 所有的probabilistic FasterRCNN都比原始的RPN-based FasterRCNN强,也快(因为P3-P7比P2-P6的计算量小一半,而且第二阶段fewer proposals)
      • CascadeRCNN-CenterNet design performs the best:所以以后就把它叫CenterNet2

    • 和其他real-time models对比

      • 大多是real-time models都是一阶段模型
      • 可以看到二阶段不仅能够比一阶段模型还快,精度还更高

    • SOTA对比

      • 报了一个56.4%的sota,但是大家口碑上好像效果很差
      • 不放图了

corner-centerNet: Keypoint Triplets for Object Detection

  1. 动机

    • based on cornerNet

    • triplet

      • corner keypoints:weak grouping ability cause false positives
      • correct predictions can be determined by checking the central parts

    • cascade corner pooling and center poolling

  2. 论点

    • whats new in CenterNet
      • triplet inference workflow
        • after a proposal is generated as a pair of corner keypoints
        • checking if there is a center keypoint of the same class
      • center pooling
        • for predicting center keypoints
        • by making the center keypoints on feature map having the max sum Hori+Verti responses
      • cascade corner pooling
        • equips the original corner pooling module with the ability of perceiving internal information
        • not only consider the boundary but also the internal directions
    • CornetNet痛点
      • fp rate高
      • small object的fp rate尤其高
      • 一个idea:cornerNet based RPN
        • 但是原生RPN都是复用的
        • 计算效率?
  3. 方法

    • center pooling

      • geometric centers & semantic centers
      • center pooling能够有效地将语义信息最丰富的点(semantic centers)传达到物理中心点(geometric centers),也就是central region