real-time semantic segmentation

cityscape leaderboard:

[PIDNet 2022] PIDNet: A Real-time Semantic Segmentation Network Inspired from PID Controller
[SFNet v1 2020] Semantic Flow for Fast and Accurate Scene Parsing
[SFNet v2 2022] SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow
[PP-LiteSeg 2022] PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
[DDRNet 2021] Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
[STDC-Seg 2021CVPR] Rethinking BiSeNet For Real-time Semantic Segmentation

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

main contributions
- propose a Flexible and Lightweight Decoder (FLD)：主要就是FPN的通道数
- propose a Unified Attention Fusion Module (UAFM)：强化特征表达
- propose a Simple Pyramid Pooling Module (SPPM)：简化PPM，low computation cost
real-time networks设计思路
- Strengthening feature representations：decoder中low-level和high-level特征的融合方式
- Contextual aggregation：花式PPM
overview
方法
- Flexible and Lightweight Decoder
  - recent lightweight models中的decoder在恢复resolution的过程中通道数保持不变：cause computation redundancy
  - FLD从high-level到low-level逐渐减少通道数
- Unified Attention Fusion Module
  - 用一个attention module来产生weight map
  - 然后对不同尺度的特征做加权和
  - Spatial Attention Module
    - 实际代码中用的这个
    - 在channel维度上做mean & max，两个feature共得到4个[1,h,w]的map，concat在一起[4,h,w]
    - 然后做conv+sigmoid得到weight map [1,h,w]
    - 然后做element-wise的mul&add
  - Channel Attention Module
    - 在spacial map上做avg pooling和max pooling，两个feature共得到4个[c,1,1]的vector，concat在一起[4c,1,1]
    - 然后做conv+sigmoid得到channel importance vec [c,1,1]
    - 然后做channel-wise的mul&add
- Simple Pyramid Pooling Module
  - 主要改动
    - reduce intermediate and output channels
    - remove the short-cut
    - replace concat with add
  - 就3个global average pooling，得到1x1、2x2、4x4的map，然后分别conv-bn-relu，然后resize回原来的尺度
  - 然后add & conv
实验
- datasets
  - Cityscapes：1+18，5,000（2975/500/1525），原始尺寸是2048x1024
  - CamVid：11类，701（367/101/233），原始尺寸是960x720
- training settings
  - SGD：momentum=0.9
  - lr：warmup，poly，
  - Cityscapes训练160k，batchsize=16，baselr=0.005，weight decay=5e-4
  - CamVid训练1k，batchsize=24，baselr=0.01，weight decay=1e-4
  - dataaug
    - random scale：scale range [0.125, 1.5] / [0.5, 2.5] for Cityscapes/CamVid
    - random crop：crop size=[1024,512] / [960,720]
    - random horizontal flipping / color jitting / normalization

SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow

动机
- widely used atrous convolution & feaature pyramid: computation intensive or ineffective
- we propose SFNet & SFNet-Lite
  - Flow Alignment Module (FAM) to learn sematic flow
  - Gated Dual Flow Alignment Module (GD-FAM): directly align the highest & lowest resolution feature maps
- speed & accuracy
  - verified on 4 driving datasets (Cityscapes, Mapillary, IDD and BDD)
  - SFNet-Lite-R18 back: 80.1 mIoU / 60 FPS
  - SFNet-Lite-STDC back: 78.8 mIoU / 120 FPS
方法
- previous methods
  - FCN
    - 开创性的工作
    - lack of detailed object boundary information due to down-sampling
  - deeplab
    - atrous convolutions （last 4/5 stage）
    - multiscale feature representation （ASPP）
  - vision transformer
    - 建模成query-based per-segment prediction
    - strong performance but real time inference speed不理想
- trade-off
  - maintain detailed resolution：空洞卷积，resolution大计算量就大，很难实时
  - get features that exhibit strong semantic representation：biFPN style feature merge，有一定提升，但还是不如those hold large feature maps
  - 本文的推测是：ineffective propagation of semantics from deep layers to shallow layers，across level的semantics are not well aligned，粗暴的downsample再upsample，无法恢复细节边缘信息
- this paper
  - propose FAM & SFNet
    - 不同stage的feature之间，先做alignment，warp low level feature之后再merge
    - R18-back： 79.8% mIoU / 33 FPS on cityscape
    - DF2-back：77.8% mIoU / 103 FPS
  - propose GD-FAM & SFNet-Lite
    - 为了更快，把密集的FAM换成只做一次的GD-FAM
    - 只merge this highest & lowest尺度的features
    - ResNet-18 back：80.1% mIoU / 49 FPS on cityscape
    - STDCv1-back： 78.7 mIoU / 120 FPS
方法
- 出发点： the misalignment between feature maps caused by residual connection, repeated downsampling and up- sampling operations
- inspiration：dynamic upsampling interpolation
- FAM
  - build in the FPN framework
  - define Semantic Flow：between different levels in a feature pyramid
  - pipeline
    - 首先通道对齐
    - 然后上采样，尺度对齐，$F \in R^{H\times W \times C}$
    - 然后concat两个level的特征，（但是没有像真正的光流对齐FlowNet一样再concat上grid coordinates）
    - 然后用两层3x3的卷积做semantic flow field的预测，$\Delta_{low-level} \in R^{H\times W\times 2}$
    - 然后warp low-level feature
    - 然后add & halve
  - 和deformable conv的区别
    - 首先这个offset是通过融合两个尺度特征得到的，DCN是特征自己预测的
    - 其次DCN是为了得到更大的/更自由的reception field，more like attention，本文是为了align feature
  - 可以看到warp & add以后的feature map相比较于直接上采样然后add，更加structurally neat，目标可以有更consistent的representation
- the whole network
  - backbone
    - ImageNet pretrained ResNet / DF series
    - 4 stages
    - stride2 in the first place per stage
    - Pyramid Pooling Module (PPM)，ASPP/NL is also experimented，后面实验部分说精度不如PPM
  - Aligned FPN decoder
    - encoder那边过来的stage2/3/4的low-level feature，被FAM aligned，然后fused into their bottom levels
    - 最终的x4的prediction feature，concat来自所有尺度的特征，考虑到也存在mis-alignment，本文也实验性的添加了PPM，但是在real-time application中没加
- Gated Dual Flow Alignment Module and SFNet-Lite
  - 上述版本的SFNet，在speed上是比BiSegNet慢的，thus we explore more compact decoder
  - takes F1 & F4 as inputs
  - outputs a refined high-resolution map
  - pipeline
    - 首先将F4上采样的F1的resolution
    - 然后concat，然后3x3convs，预测offsets，$\Delta \in R^{H\times W\times 4}$
    - 然后split成$\Delta_{F1}$和$\Delta_{F4}$，分别给到F1和F4做align
    - 再用F1和F4生成一个gate map，attention-style的结构，用pooling，1x1 conv和sigmoid，给两个warped feature做gated sum，思路是make full use of high level semantic feature and let the low level feature as a supplement of high level feature
  - SFNet-Lite structure
实验

STDC-Seg: Rethinking BiSeNet For Real-time Semantic Segmentation

动机
- BiSeNet
  - use extra path to encode spatial information (low-level)
  - time consuming
  - not convenient to use pretrained backbones
  - the auxiliary path is always lack of low-level information guidance
- this paper
  - 回归test-time singe-stream manner
  - 用一个Detail guidance module来促进backbone的low-level stage学习spatial feature，有直接监督，且test-time free cost
  - 设计了STDC backbone，主要包含STDC module (Short-Term Dense Concatentae Network)，有点类似denseNet的block
- verified on
  - ImageNet
  - Cityscapes: STDC1-Seg50 / 71.9% mIoU / 250.4 FPS, STDC2-Seg75 / 76.8% mIoU / 97.0 FPS
  - CamVid
overview
- single-stream
- STDC backbone
方法
- Short-Term Dense Concatenate Module
  - each module is separated into several blocks：block1永远是1x1，block2/3/4是3x3
  - the output gathers multi-scale information
  - 一种是stride1的一种是stride2的，reception field如下
- Classification Architecture
  - stage1/2分别是一个conv-bn-relu
  - stage3/4/5是STDC Module，每个stage的第一个module用stride2
- Segmentation Architecture
  - STDC back：用stage3/4/5的feature map（x8/16/32）
  - stage3的feature作为low-level feature
  - stage4/5以及global pooling的stage5的feature作为high-level context feature，做FPN：stage4/5通过Attention Refine Module（类似SE），然后和前一个feature level做add，然后上采样，然后conv
  - 以上的feature通过Feature Fusion Module（也类似SE block）融合
  - SegHead：3x3conv-1x1conv
  - stage3的feature上还接了一个DetailHead
- Detail Guidance of Low-level Features
  - detail path是用来encode spatial detail（boundary/corner）
  - 建模成binary segmentation task
  - 首先将ground truth map通过Detail Aggregation module得到detail map
    - 一个stride1/2/4的Laplacian operator（conv kernel）
    - 然后是upsampling
    - 然后fuse and 1x1 conv
    - 最后用thresh 0.1转换成binary detail mask
  - detail loss：dice + bce
  - 有了detail guidance以后能够force backbone的stage3 feature保留更加detail的low-level feature
实验
- backbone实验
  - 对标MobileNetV3和EfficientNet-B0，精度和速度都是更好的
  - ImageNet精度&Flops，Flops比较大，但都是3x3卷积，推理速度更快