cityscape leaderboard:

- [PIDNet 2022] PIDNet: A Real-time Semantic Segmentation Network Inspired from PID Controller
- [SFNet v1 2020] Semantic Flow for Fast and Accurate Scene Parsing
- [SFNet v2 2022] SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow
- [PP-LiteSeg 2022] PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
- [DDRNet 2021] Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
- [STDC-Seg 2021CVPR] Rethinking BiSeNet For Real-time Semantic Segmentation
PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
- main contributions - propose a Flexible and Lightweight Decoder (FLD):主要就是FPN的通道数
- propose a Unified Attention Fusion Module (UAFM):强化特征表达
- propose a Simple Pyramid Pooling Module (SPPM):简化PPM,low computation cost
 
- real-time networks设计思路 - Strengthening feature representations:decoder中low-level和high-level特征的融合方式
- Contextual aggregation:花式PPM
 
- overview  
- 方法 - Flexible and Lightweight Decoder - recent lightweight models中的decoder在恢复resolution的过程中通道数保持不变:cause computation redundancy
- FLD从high-level到low-level逐渐减少通道数
 
- Unified Attention Fusion Module - 用一个attention module来产生weight map
- 然后对不同尺度的特征做加权和  
- Spatial Attention Module - 实际代码中用的这个
- 在channel维度上做mean & max,两个feature共得到4个[1,h,w]的map,concat在一起[4,h,w]
- 然后做conv+sigmoid得到weight map [1,h,w]
- 然后做element-wise的mul&add
 
- Channel Attention Module- 在spacial map上做avg pooling和max pooling,两个feature共得到4个[c,1,1]的vector,concat在一起[4c,1,1]
- 然后做conv+sigmoid得到channel importance vec [c,1,1]
- 然后做channel-wise的mul&add
 
 
- Simple Pyramid Pooling Module - 主要改动- reduce intermediate and output channels
- remove the short-cut
- replace concat with add
 
- 就3个global average pooling,得到1x1、2x2、4x4的map,然后分别conv-bn-relu,然后resize回原来的尺度
- 然后add & conv  
 
- 主要改动
 
- 实验 - datasets- Cityscapes:1+18,5,000(2975/500/1525),原始尺寸是2048x1024
- CamVid:11类,701(367/101/233),原始尺寸是960x720
 
- training settings- SGD:momentum=0.9
- lr:warmup,poly,
- Cityscapes训练160k,batchsize=16,baselr=0.005,weight decay=5e-4
- CamVid训练1k,batchsize=24,baselr=0.01,weight decay=1e-4
- dataaug- random scale:scale range [0.125, 1.5] / [0.5, 2.5] for Cityscapes/CamVid
- random crop:crop size=[1024,512] / [960,720]
- random horizontal flipping / color jitting / normalization
 
 
 
- datasets
SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow
- 动机 - widely used atrous convolution & feaature pyramid: computation intensive or ineffective
- we propose SFNet & SFNet-Lite- Flow Alignment Module (FAM) to learn sematic flow
- Gated Dual Flow Alignment Module (GD-FAM): directly align the highest & lowest resolution feature maps
 
- speed & accuracy - verified on 4 driving datasets (Cityscapes, Mapillary, IDD and BDD)
- SFNet-Lite-R18 back: 80.1 mIoU / 60 FPS
- SFNet-Lite-STDC back: 78.8 mIoU / 120 FPS
 
 
- 方法 - previous methods - FCN- 开创性的工作
- lack of detailed object boundary information due to down-sampling
 
- deeplab- atrous convolutions (last 4/5 stage)
- multiscale feature representation (ASPP)
 
- vision transformer- 建模成query-based per-segment prediction
- strong performance but real time inference speed不理想
 
 
- FCN
- trade-off - maintain detailed resolution:空洞卷积,resolution大计算量就大,很难实时
- get features that exhibit strong semantic representation:biFPN style feature merge,有一定提升,但还是不如those hold large feature maps
- 本文的推测是:ineffective propagation of semantics from deep layers to shallow layers,across level的semantics are not well aligned,粗暴的downsample再upsample,无法恢复细节边缘信息
 
- this paper - propose FAM & SFNet- 不同stage的feature之间,先做alignment,warp low level feature之后再merge
- R18-back: 79.8% mIoU / 33 FPS on cityscape
- DF2-back:77.8% mIoU / 103 FPS
 
- propose GD-FAM & SFNet-Lite - 为了更快,把密集的FAM换成只做一次的GD-FAM
- 只merge this highest & lowest尺度的features
- ResNet-18 back:80.1% mIoU / 49 FPS on cityscape
- STDCv1-back: 78.7 mIoU / 120 FPS
  
 
- propose FAM & SFNet
 
- 方法 - 出发点: the misalignment between feature maps caused by residual connection, repeated downsampling and up- sampling operations 
- inspiration:dynamic upsampling interpolation 
- FAM - build in the FPN framework 
- define Semantic Flow:between different levels in a feature pyramid 
- pipeline - 首先通道对齐
- 然后上采样,尺度对齐,$F \in R^{H\times W \times C}$
- 然后concat两个level的特征,(但是没有像真正的光流对齐FlowNet一样再concat上grid coordinates)
- 然后用两层3x3的卷积做semantic flow field的预测,$\Delta_{low-level} \in R^{H\times W\times 2}$
- 然后warp low-level feature
- 然后add & halve  
 
- 和deformable conv的区别 - 首先这个offset是通过融合两个尺度特征得到的,DCN是特征自己预测的
- 其次DCN是为了得到更大的/更自由的reception field,more like attention,本文是为了align feature
 
- 可以看到warp & add以后的feature map相比较于直接上采样然后add,更加structurally neat,目标可以有更consistent的representation  
 
- the whole network - backbone- ImageNet pretrained ResNet / DF series
- 4 stages
- stride2 in the first place per stage
- Pyramid Pooling Module (PPM),ASPP/NL is also experimented,后面实验部分说精度不如PPM
 
- Aligned FPN decoder- encoder那边过来的stage2/3/4的low-level feature,被FAM aligned,然后fused into their bottom levels
- 最终的x4的prediction feature,concat来自所有尺度的特征,考虑到也存在mis-alignment,本文也实验性的添加了PPM,但是在real-time application中没加
 
 
- backbone
- Gated Dual Flow Alignment Module and SFNet-Lite - 上述版本的SFNet,在speed上是比BiSegNet慢的,thus we explore more compact decoder 
- takes F1 & F4 as inputs 
- outputs a refined high-resolution map 
- pipeline - 首先将F4上采样的F1的resolution
- 然后concat,然后3x3convs,预测offsets,$\Delta \in R^{H\times W\times 4}$
- 然后split成$\Delta_{F1}$和$\Delta_{F4}$,分别给到F1和F4做align 
- 再用F1和F4生成一个gate map,attention-style的结构,用pooling,1x1 conv和sigmoid,给两个warped feature做gated sum,思路是make full use of high level semantic feature and let the low level feature as a supplement of high level feature   
 
- SFNet-Lite structure  
 
 
- 实验 
STDC-Seg: Rethinking BiSeNet For Real-time Semantic Segmentation
- 动机 - BiSeNet- use extra path to encode spatial information (low-level)
- time consuming
- not convenient to use pretrained backbones
- the auxiliary path is always lack of low-level information guidance
 
- this paper- 回归test-time singe-stream manner
- 用一个Detail guidance module来促进backbone的low-level stage学习spatial feature,有直接监督,且test-time free cost
- 设计了STDC backbone,主要包含STDC module (Short-Term Dense Concatentae Network),有点类似denseNet的block
 
- verified on- ImageNet
- Cityscapes: STDC1-Seg50 / 71.9% mIoU / 250.4 FPS, STDC2-Seg75 / 76.8% mIoU / 97.0 FPS
- CamVid
 
 
- BiSeNet
- overview - single-stream  
- STDC backbone  
 
- 方法 - Short-Term Dense Concatenate Module - each module is separated into several blocks:block1永远是1x1,block2/3/4是3x3
- the output gathers multi-scale information
- 一种是stride1的一种是stride2的,reception field如下  
 
- Classification Architecture - stage1/2分别是一个conv-bn-relu
- stage3/4/5是STDC Module,每个stage的第一个module用stride2  
 
- Segmentation Architecture - STDC back:用stage3/4/5的feature map(x8/16/32)
- stage3的feature作为low-level feature
- stage4/5以及global pooling的stage5的feature作为high-level context feature,做FPN:stage4/5通过Attention Refine Module(类似SE),然后和前一个feature level做add,然后上采样,然后conv
- 以上的feature通过Feature Fusion Module(也类似SE block)融合
- SegHead:3x3conv-1x1conv
- stage3的feature上还接了一个DetailHead  
 
- Detail Guidance of Low-level Features - detail path是用来encode spatial detail(boundary/corner)
- 建模成binary segmentation task
- 首先将ground truth map通过Detail Aggregation module得到detail map- 一个stride1/2/4的Laplacian operator(conv kernel)
- 然后是upsampling
- 然后fuse and 1x1 conv
- 最后用thresh 0.1转换成binary detail mask
 
- detail loss:dice + bce
- 有了detail guidance以后能够force backbone的stage3 feature保留更加detail的low-level feature  
 
 
- 实验 - backbone实验 - 对标MobileNetV3和EfficientNet-B0,精度和速度都是更好的  
- ImageNet精度&Flops,Flops比较大,但都是3x3卷积,推理速度更快 