real-time semantic segmentation

cityscape leaderboard:

  • [PIDNet 2022] PIDNet: A Real-time Semantic Segmentation Network Inspired from PID Controller
  • [SFNet v1 2020] Semantic Flow for Fast and Accurate Scene Parsing
  • [SFNet v2 2022] SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow
  • [PP-LiteSeg 2022] PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
  • [DDRNet 2021] Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
  • [STDC-Seg 2021CVPR] Rethinking BiSeNet For Real-time Semantic Segmentation

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

  1. main contributions

    • propose a Flexible and Lightweight Decoder (FLD):主要就是FPN的通道数
    • propose a Unified Attention Fusion Module (UAFM):强化特征表达
    • propose a Simple Pyramid Pooling Module (SPPM):简化PPM,low computation cost
  2. real-time networks设计思路

    • Strengthening feature representations:decoder中low-level和high-level特征的融合方式
    • Contextual aggregation:花式PPM
  3. overview

  4. 方法

    • Flexible and Lightweight Decoder

      • recent lightweight models中的decoder在恢复resolution的过程中通道数保持不变:cause computation redundancy
      • FLD从high-level到low-level逐渐减少通道数
    • Unified Attention Fusion Module

      • 用一个attention module来产生weight map
      • 然后对不同尺度的特征做加权和

      • Spatial Attention Module

        • 实际代码中用的这个
        • 在channel维度上做mean & max,两个feature共得到4个[1,h,w]的map,concat在一起[4,h,w]
        • 然后做conv+sigmoid得到weight map [1,h,w]
        • 然后做element-wise的mul&add
      • Channel Attention Module
        • 在spacial map上做avg pooling和max pooling,两个feature共得到4个[c,1,1]的vector,concat在一起[4c,1,1]
        • 然后做conv+sigmoid得到channel importance vec [c,1,1]
        • 然后做channel-wise的mul&add
    • Simple Pyramid Pooling Module

      • 主要改动
        • reduce intermediate and output channels
        • remove the short-cut
        • replace concat with add
      • 就3个global average pooling,得到1x1、2x2、4x4的map,然后分别conv-bn-relu,然后resize回原来的尺度
      • 然后add & conv

  5. 实验

    • datasets
      • Cityscapes:1+18,5,000(2975/500/1525),原始尺寸是2048x1024
      • CamVid:11类,701(367/101/233),原始尺寸是960x720
    • training settings
      • SGD:momentum=0.9
      • lr:warmup,poly,
      • Cityscapes训练160k,batchsize=16,baselr=0.005,weight decay=5e-4
      • CamVid训练1k,batchsize=24,baselr=0.01,weight decay=1e-4
      • dataaug
        • random scale:scale range [0.125, 1.5] / [0.5, 2.5] for Cityscapes/CamVid
        • random crop:crop size=[1024,512] / [960,720]
        • random horizontal flipping / color jitting / normalization

SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow

  1. 动机

    • widely used atrous convolution & feaature pyramid: computation intensive or ineffective
    • we propose SFNet & SFNet-Lite
      • Flow Alignment Module (FAM) to learn sematic flow
      • Gated Dual Flow Alignment Module (GD-FAM): directly align the highest & lowest resolution feature maps
    • speed & accuracy
      • verified on 4 driving datasets (Cityscapes, Mapillary, IDD and BDD)
      • SFNet-Lite-R18 back: 80.1 mIoU / 60 FPS
      • SFNet-Lite-STDC back: 78.8 mIoU / 120 FPS
  2. 方法

    • previous methods

      • FCN
        • 开创性的工作
        • lack of detailed object boundary information due to down-sampling
      • deeplab
        • atrous convolutions (last 4/5 stage)
        • multiscale feature representation (ASPP)
      • vision transformer
        • 建模成query-based per-segment prediction
        • strong performance but real time inference speed不理想
    • trade-off

      • maintain detailed resolution:空洞卷积,resolution大计算量就大,很难实时
      • get features that exhibit strong semantic representation:biFPN style feature merge,有一定提升,但还是不如those hold large feature maps
      • 本文的推测是:ineffective propagation of semantics from deep layers to shallow layers,across level的semantics are not well aligned,粗暴的downsample再upsample,无法恢复细节边缘信息
    • this paper

      • propose FAM & SFNet
        • 不同stage的feature之间,先做alignment,warp low level feature之后再merge
        • R18-back: 79.8% mIoU / 33 FPS on cityscape
        • DF2-back:77.8% mIoU / 103 FPS
      • propose GD-FAM & SFNet-Lite

        • 为了更快,把密集的FAM换成只做一次的GD-FAM
        • 只merge this highest & lowest尺度的features
        • ResNet-18 back:80.1% mIoU / 49 FPS on cityscape
        • STDCv1-back: 78.7 mIoU / 120 FPS

  3. 方法

    • 出发点: the misalignment between feature maps caused by residual connection, repeated downsampling and up- sampling operations

    • inspiration:dynamic upsampling interpolation

    • FAM

      • build in the FPN framework

      • define Semantic Flow:between different levels in a feature pyramid

      • pipeline

        • 首先通道对齐
        • 然后上采样,尺度对齐,$F \in R^{H\times W \times C}$
        • 然后concat两个level的特征,(但是没有像真正的光流对齐FlowNet一样再concat上grid coordinates)
        • 然后用两层3x3的卷积做semantic flow field的预测,$\Delta_{low-level} \in R^{H\times W\times 2}$
        • 然后warp low-level feature
        • 然后add & halve

      • 和deformable conv的区别

        • 首先这个offset是通过融合两个尺度特征得到的,DCN是特征自己预测的
        • 其次DCN是为了得到更大的/更自由的reception field,more like attention,本文是为了align feature
      • 可以看到warp & add以后的feature map相比较于直接上采样然后add,更加structurally neat,目标可以有更consistent的representation

    • the whole network

      • backbone
        • ImageNet pretrained ResNet / DF series
        • 4 stages
        • stride2 in the first place per stage
        • Pyramid Pooling Module (PPM),ASPP/NL is also experimented,后面实验部分说精度不如PPM
      • Aligned FPN decoder
        • encoder那边过来的stage2/3/4的low-level feature,被FAM aligned,然后fused into their bottom levels
        • 最终的x4的prediction feature,concat来自所有尺度的特征,考虑到也存在mis-alignment,本文也实验性的添加了PPM,但是在real-time application中没加
    • Gated Dual Flow Alignment Module and SFNet-Lite

      • 上述版本的SFNet,在speed上是比BiSegNet慢的,thus we explore more compact decoder

      • takes F1 & F4 as inputs

      • outputs a refined high-resolution map

      • pipeline

        • 首先将F4上采样的F1的resolution
        • 然后concat,然后3x3convs,预测offsets,$\Delta \in R^{H\times W\times 4}$
        • 然后split成$\Delta_{F1}$和$\Delta_{F4}$,分别给到F1和F4做align

        • 再用F1和F4生成一个gate map,attention-style的结构,用pooling,1x1 conv和sigmoid,给两个warped feature做gated sum,思路是make full use of high level semantic feature and let the low level feature as a supplement of high level feature

      • SFNet-Lite structure

  4. 实验

STDC-Seg: Rethinking BiSeNet For Real-time Semantic Segmentation

  1. 动机

    • BiSeNet
      • use extra path to encode spatial information (low-level)
      • time consuming
      • not convenient to use pretrained backbones
      • the auxiliary path is always lack of low-level information guidance
    • this paper
      • 回归test-time singe-stream manner
      • 用一个Detail guidance module来促进backbone的low-level stage学习spatial feature,有直接监督,且test-time free cost
      • 设计了STDC backbone,主要包含STDC module (Short-Term Dense Concatentae Network),有点类似denseNet的block
    • verified on
      • ImageNet
      • Cityscapes: STDC1-Seg50 / 71.9% mIoU / 250.4 FPS, STDC2-Seg75 / 76.8% mIoU / 97.0 FPS
      • CamVid
  2. overview

    • single-stream

    • STDC backbone

  3. 方法

    • Short-Term Dense Concatenate Module

      • each module is separated into several blocks:block1永远是1x1,block2/3/4是3x3
      • the output gathers multi-scale information
      • 一种是stride1的一种是stride2的,reception field如下

    • Classification Architecture

      • stage1/2分别是一个conv-bn-relu
      • stage3/4/5是STDC Module,每个stage的第一个module用stride2

    • Segmentation Architecture

      • STDC back:用stage3/4/5的feature map(x8/16/32)
      • stage3的feature作为low-level feature
      • stage4/5以及global pooling的stage5的feature作为high-level context feature,做FPN:stage4/5通过Attention Refine Module(类似SE),然后和前一个feature level做add,然后上采样,然后conv
      • 以上的feature通过Feature Fusion Module(也类似SE block)融合
      • SegHead:3x3conv-1x1conv
      • stage3的feature上还接了一个DetailHead

    • Detail Guidance of Low-level Features

      • detail path是用来encode spatial detail(boundary/corner)
      • 建模成binary segmentation task
      • 首先将ground truth map通过Detail Aggregation module得到detail map
        • 一个stride1/2/4的Laplacian operator(conv kernel)
        • 然后是upsampling
        • 然后fuse and 1x1 conv
        • 最后用thresh 0.1转换成binary detail mask
      • detail loss:dice + bce
      • 有了detail guidance以后能够force backbone的stage3 feature保留更加detail的low-level feature

  4. 实验

    • backbone实验

      • 对标MobileNetV3和EfficientNet-B0,精度和速度都是更好的

      • ImageNet精度&Flops,Flops比较大,但都是3x3卷积,推理速度更快