YOLOACT

  • [YOLACT] Real-time Instance Segmentation:33 FPS/30 mAP
  • [YOLACT++] Better Real-time Instance Segmentation:33.5 FPS/34.1 mAP

YOLACT: Real-time Instance Segmentation

  1. 动机

    • create a real-time instance segmentation base on fast, one-stage detection model

    • forgoes an explicit localization step (e.g., feature repooling)

      • doesn’t depend on repooling (RoI Pooling)
      • produces very high-quality masks
    • set two parallel subtasks

      • prototypes——conv
      • mask coefficients——fc
      • 之后将模板mask和实例mask系数进行线性组合来获得实例的mask
  • ‘prototypes’: vocabulary

  • fully-convolutional

    • localization is still translation variant
  • Fast NMS

  1. 论点

    • State-of-the-art approaches to instance segmentation like Mask R- CNN and FCIS directly build off of advances in object detection like Faster R-CNNand R-FCN

      • focus primarily on performance over speed
      • these methods “re-pool” features in some bounding box region
      • inherently sequential therefore difficult to accelerate
    • One-stage instance segmentation methods generate position sensitive maps

      • still require repooling or other non-trivial computations
    • prototypes

      • related works use prototypes to represent features (Bag of Feature)
      • we use them to assemble masks for instance segmentation
      • we learn prototypes that are specific to each image, rather than global prototypes shared across the entire dataset
    • Bag of Feature

      • BOF假设图像相当于一个文本,图像中的不同局部区域或特征可以看作是构成图像的词汇(codebook)

      • 所有的样本共享一份词汇本,针对每个图像,统计每个单词的频次,即可得到图片的特征向量

  2. 方法

    • parallel tasks

      • The first branch uses an FCN to produce a set of image-sized “prototype masks” that do not depend on any one instance.
      • The second adds an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s rep- resentation in the prototype space.
      • linearly combining
    • Rationale

      • masks are spatially coherent:mash是空间相关的,相邻像素很可能是一类
      • 卷积层能够利用到这种空间相关性,但是fc层不能
      • 而one-stage检测器的检测头通常是fc层??
      • making use of fc layers, which are good at producing semantic vectors
      • and conv layers, which are good at producing spatially coherent masks
    • Prototype

      • 在backbone feature layer P3上接一个FCN
        • taking protonet from deeper backbone features produces more robust masks
        • higher resolution prototypes result in both higher quality masks and better performance on smaller objects
        • upsample到x4的尺度to increase performance on small objects
      • head包含k个channels

        • 梯度回传来源于最终的final assembled mask,不是当前这个头
        • unbounded:ReLU or no nonlinearity
        • We choose ReLU for more interpretable prototypes

    • Mask Coefficients

      • a third branch in parallel with detection heads
      • nonlinearity:要有正负,所以tanh

    • Mask Assembly

      • linear combination + sigmoid: $M=\sigma(PC^T)$
      • loss
        • cls loss:w=1, 和ssd一样,c+1 softmax
        • box reg loss:w=1.5, 和ssd一样,smooth-L1
        • mask loss:w=6.125, BCE
      • crop mask
        • eval:用predict box去crop
        • train:用gt box去crop,同时还要给mask loss除以gt box的面积,to preserve small objects
    • Emergent Behavior

      • 不crop也能分割中大目标:

        • YOLACT learns how to localize instances on its own via different activations in its prototypes
        • 而不是靠定位结果
      • translation variant

        • the consistent rim of padding in modern FCNs like ResNet gives the network the ability to tell how far away from the image’s edge a pixel is,所以用一张纯色的图能够看出kernel实际highlight的是哪部分特征
        • 同一种kernel,同一种五角星,在画面不同位置,对应的响应值是不同的,说明fcn是能够提取物体位置这样的语义信息的
        • prototypes are compressible:

          • 增加模版数目反而不太有效,because predicting coefficients is difficult,
          • the network has to play a balancing act to produce the right coef- ficients, and adding more prototypes makes this harder,
          • We choose 32 for its mix of performance and speed

    • Network

      • speed as well as feature richness
      • backbone参考RetinaNet,ResNet-101 + FPN
        • 550x550 input,resize
        • 去掉P2,add P6&P7
        • 3 anchors per level,[1, 1/2, 2]
        • P3的anchor尺寸是24x24,接下来每层double the scale
        • 检测头:shared conv+parallel conv
        • OHEM
      • single GPU:batch size 8 using ImageNet weights,no extra bn layers

    • Fast NMS

      • 构造cxnxn的矩阵,c代表每个class
      • 然后搞成上三角,求column-wise max
      • 再IoU threshold
      • 15.0 ms faster with a performance loss of 0.3 mAP
    • Semantic Segmentation Loss
      • using modules not executed at test time
      • P3上1x1 conv,sigmoid and c channels
      • w=1
      • +0.4 mAP boost

YOLACT++: Better Real-time Instance Segmentation