


  • [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,复旦,水,感觉就是把FCN的back换成transformer

  • [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation,英伟达,直接使用transformer encoder做unet encoder

  • [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,encoder stream里面加transformer block

  • [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation,大学,CNN feature和Transformer feature进行bifusion


  • [Swin-Unet 2021] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation,TUM,2D的Unet-like pure transformer,用swin做encoder,和与其对称的decoder
  • [nnFormer 2021] nnFormer: Interleaved Transformer for Volumetric Segmentation,港大,对标nn-Unet,3D版本的Swin-Unet,完全就是照着上一篇写的
  • [UPerNet 2018] Unified Perceptual Parsing for Scene Understanding,PKU&字节,Swin Segmentaion的补充材料,Swin的down-stream task选择用UperNet as base framework
  • [SegFormer 2021] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,港大&英伟达,参照FCN范式的(CNN+FPN+seg head),设计了swin+MLP decoder的全linear网络,用于分割

Swin Transformer for Semantic Segmentaion

补充Swin paper附录里面关于分割的描述:

nnFormer: Interleaved Transformer for Volumetric Segmentation

  1. 动机

    • 用transformer的ability to exploit long-term dependencies,去弥补卷积神经网络先天的spatial inductive bias
    • recently transformer-based approaches
      • 将transformer作为一个辅助模块,用于编码global context
      • 没有将transformer最核心的,self-attention,有效的整合进CNN
    • nnFormer:not-another transFormer
      • volume-based self-attention,极大降低计算量
      • 打败了Swin-Unet和nnUnet
  2. 论点

    • Transformers

      • self-attention
      • capture long-range dependencies
      • give predictions more consisitent with humans
    • previous approaches

      • TransUNet:Unet结构类似,CNN提取特征,再接一个transformer辅助编码全局信息,但是一两层的transformer layer并不足以提取到这种长距离约束
      • Swin-UNet:有了appropriate的下采样方法,transformer能够学习hierarchical object concepts at different scales,但它是一个纯transformer的结构,用hierarchical的transformer block构造encoder和decoder,整体也是Unet结构,没有探索如何将卷积和self-attention有机结合
    • nnFormer contributions

      • hybrid stem:卷积和self-attention都用上了,并且都能充分发挥能力,他的encoder:

        • 首先是一个轻量的conv embedding layer,好处是卷积能够提供更precise的spatial information,
        • 然后是交替的transformer blocks和convolutional down-sampling blocks,capture long-term dependencies at various scales

      • V-MSA:volume-based multi-head self-attention

        • a computational-efficient way to capture inter-slice dependencies
        • 计算复杂度降低90%以上
        • 应该就是类似于swin那种inter-patch & inter-patch吧?
  3. 方法

    • overview

      • U-net结构:

        • embedding block + encoder + decoder + patch expanding block
        • 三次下采样 & 三次上采样
        • long residual connections

    • encoder

      • input:3D patch $X \in R^{H \times W \times D}$

      • embedding block

        • 将3D patch转化成patch tokens,$X_e \in R^{\frac{H}{4}\frac{W}{4}\frac{D}{2}C}$,代表的是high-resolution spatial information
        • $\frac{H}{4}\frac{W}{4}\frac{D}{2}$是token个数
        • C是tensor channel,192/96
        • 4个连续的kernel3x3的卷积层替代Swin里面的big kernel:小卷积核给出的解释是计算量&感受野,没什么特别的,用卷积embedding给出的解释是pixel-level编码局部spatial信息,more precisely
        • 前三层卷积后面+GELU+LN,stride在1、3层,如图

      • transformer block

        • hierarchical

        • compute self-attention within 3D local volumes (instead of 2D local windows)

        • input:tokens representation of 3D patch, $X_t \in R^{L \times C}$

        • 首先reshape:对token sequence,再次划分local volume,$\tilde X_t \in R^{N_V \times N_T \times C}$

          • local volume里面包含一组空间相邻的tokens
          • $N_V$是volume的数目(类似Swin里面window的数目)
          • $N_T=S_H \times S_W \times S_D$ 是每个local volumes里面token的个数,{4,4,4}/{5,5,3}
        • 然后跟Swin一样,两个连续的transformer blocks,3D windows instead of 2D

          • V-MSA:volume-based multi-head self-attention
          • SV-MSA:shifted version

        • 反正就是3D版的swin,回去看swin更清晰

      • down-sampling block

        • 就是strided conv,说是相比较于neighboring concatenation,能产生更hierarchical的representation,有助于learn at multi scales

    • decoder

      • 和encoder高度对称
      • down-samp对标替换成strided deconvolution
      • 然后和encoder之间还有long-range connection,融合semantic和fine-grained information
      • 最后的expanding block也是用了deconv

  4. 实验

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

  1. 动机
    • Unet-like pure Transformer
      • 用Swin transformer做encoder
      • 对称的decoder,用patch expanding layer做上采样
    • outperforms full-convolution / combined methods
  2. 论点

    • CNN的局限性
      • 提取explicit global & long-range information
      • meanwhile Swin在各项任务上SOTA了
    • Swin-Unet
      • the first pure Transformer-based Unet-shaped architecture
      • consists of encoder, bottleneck, decoder and skip connections
      • token:non-overlapping patches split from the input image
      • fed into encoder:得到context features
      • fed into decoder:将global features再upsample回input resolution
      • patch expanding layer:不用conv/interpolation,实现spatial和feature-dim的increase
      • skip connection:对Transformer-based的Unet结构仍旧有效
  3. 方法

    • overview

      • patch partition
        • 将图像切分成不重叠的patches,patch size是4x4
        • 每个patch的feature dimension就是4x4x3=48,也就是48-dim vec
      • linear embedding
        • 将固定的patch dimension映射到任意给定维度C
      • 交替的Swin Transformer blocks和Patch Merging
        • generate hierarchical feature representations
        • Swin Transformer block 是负责学feature representation的
        • Patch Merging是负责维度变换(下采样/上采样)的
      • 对称的decoder:交替的Swin Transformer blocks和Patch Expanding
        • Patch Expanding将相邻特征图拼接起来,组成2x大的特征图,同时减少特征维度
      • 最后一个Patch Expanding layer则执行4倍上采样
    • Swin Transformer block

      • based on shifted windows
      • 两个连续的Transformer block为一组
      • 每个block内部都是LN-MSA-LN-MLP,residual,GELU
      • 第一个block的MSA是W-MSA
      • 第二个block的MSA是SW-MSA

    • encoder

      • input:C-dim tokens,$\frac{H}{4} \times \frac{W}{4}$个tokens
      • patch merging layer
        • 将patches切分成2x2的4个parts
        • 然后将4个part在特征维度上concat
        • 然后接一个linear layer,将特征维度的dim转换为2C
        • 这样spatial resolution就downsampled by 2x
        • 特征维度加倍了2x
    • bottleneck

      • encoder和decoder中间那个部分
      • 用了两个连续的Swin transformer block
      • 【QUESTION】也是shifited windows的吗?
      • 这个part特征维度不变
    • decoder

      • patch expanding layer
        • given input features:$(\frac{W}{32} \times \frac{H}{32}\times 8C)$
        • 先是一个linear layer,加倍feature dim:$(\frac{W}{32} \times \frac{H}{32}\times 16C)$
        • 然后合并相邻4个patch tokens:$(\frac{W}{16} \times \frac{H}{16}\times 4C)$
    • skip connection

      • concat以后接一个linear layer,保持特征维度不变
  4. 实验

UPerNet: Unified Perceptual Parsing for Scene Understanding

  1. 动机

    • 人类对于图像的识别是存在多个层次的
      • scenes
      • objects inside
      • compositional parts
      • textures and surfaces
    • our work
      • study a new task caled Unified Perceptual Parsing(UPP):建立了一个“统一感知解析”的任务
      • 要求模型recognize as many visual concepts as possible
      • propose a multi-task framework UPerNet & a training strategy
    • repo:
      • semantic segmentation
      • multi-task
  2. 论点

    • various visual recognition tasks are mostly studied independently
      • 过去的task总是将不同level的视觉信息分开研究
      • is it possible for a neural network to solve several visual recognition tasks simultaneously?
    • thus we propose Unified Perceptual Parsing(UPP)task
      • 有两个data issue
      • no single dataset annotated with all levels of visual information
      • 不同perceptual levels的标注形式也不统一
    • thus we propose UPerNet
      • overcome the heterogeneity of different datasets
      • learns to detect various visual concepts jointly
      • 主要实现方式是每个iteration只选取一种数据集,同时只更新相关网络层
    • we further propose a training method
      • enable the network to predict pixel-wise texture labels using only image-level annotations
  3. 方法

    • Defining Unified Perceptual Parsing

      • 统一感知解析:从一张图中获取各种维度的视觉信息

        • scene labels
        • objects
        • parts of objects
        • materials and textures of objects
      • datasets

        • 使用了Broadly and Densely Labeled Dataset:整合了好几个数据集,contains various visual concepts

        • Objects, object parts and materials are segmented down to pixel level while textures and scenes are annotated

          at image level:目标、组成成分、材质是像素级标注,纹理和场景是图像级标注

        • standardize调整

          • data imabalance issue:丢掉一部分尾部数据
          • merge across dataset:合并不同数据集的同类数据
          • merge under-sampled labels:合并子类
        • our Broden+

          • 57, 095 images in total:51,617 training /5, 478 validation
          • 22, 210 images from ADE20K, 10, 103 images from Pascal-Context and Pascal-Part, 19, 142 images from Open- Surfaces and 5, 640 images from DTD

      • metrics

        • Pixel Accuracy (P.A.):the proportion of correctly classified pixels
        • mean IoU (mIoU):目标前景的平均IoU,会影响bg分割的表现
        • mIoU-bg:前景占比很小的时候,再加上bg IoU,object parts
        • top-1 acc:图像级标注使用top1 acc,scene and texture classification
    • Designing Networks for Unified Perceptual Parsing

      • overview

        • 因为包含high/low level visual tasks,所以网络也是multi-level的:FPN with a PPM
        • scene head是image-level classification label,直接接PPM的输出
        • object and part heads是多尺度的,使用FPN fusion的输出
        • material head是细粒度任务,使用FPN的highest resolution featuremap的输出
        • texture head是更加细粒度任务,接在backobne的Res-2 block后面,而且是在网络训练完其他任务以后再fine-tuning的
      • FPN

        • multi-level feature
        • use [top-down path + lateral connections] to fuse high-level semantic information into middle & low
        • conv-BN-ReLU,channel = 512
      • PPM

        • from PSPNet
        • 用来解决CNN理论上感受野足够大,但实际上相当小这个问题
        • 相比较于用dilated methods去扩大感受野的方式,好处是down-sampling rate更大(还是x32),能够提取high-level semantics
      • ResNet

        • 使用每个stage的输出作为level feature map,{C2, C3,C4,C5},x4-x32
        • FPN的输出feature map,{P2, P3,P4,P5},P5是PPM的输出
      • heads

        • scene head:分类器,接PPM输出,global average pooling + linear layer
        • object/parts head:实验发现使用fusion map表现好于P2,fusion map通过bilinear interpolating & concat & conv
        • materials head:on top of P2 rather than fused features
        • texture head:
          • texture label是图像级的,而且来自non-natural images
          • directly fusing these images with other natural images is harmful to other tasks
          • 同时我们希望预测是pixel-level
          • 我们把它接在C2上,append several convolutional layers,感受野small enough,而且backbone layers不回传梯度,只更新head layers
          • training images使用64x64的,确保模型只focus在local detail上
          • only fine-tune a few epochs
    • training settings

      • poly learning rate,initial=0.2,power=0.9
      • weight decay=1e-4,momentum=0.9
      • training inputs:常用的目标检测rescale方法,随机将shorter side变成{300, 375, 450, 525, 600}
      • inference inputs:使用fixed shorter side 450
      • longer side < 1200
      • 为了防止noisy gradient,每个mini-batch随机选一种data source,按照数据集大小采样,只梯度更新相关path的参数
      • object and material segmentation计算前景loss
      • part segmentation计算前景+背景
      • on each GPU a mini-batch involves 2 images
      • sync-SGD & sync BN across 8 GPUs
      • training iterations of ADE20k (20k images) is 100k,其他数据集对应rescale
    • Design discussion

      • previous segmentation networks主要是FCN,用pretrained backbones搭配dilated convs,扩大感受野的同时维持比较大的resolution
      • 但是原始的backobne,通常在stage4/5的时候有比较多的层,如resnet101的res4&res5就占了78层
        • 改用dilated convs一是计算量memory飙升
        • 二是有违最初的设计逻辑,未必还能发挥出原始的效能
        • 第三就是不好兼顾本文任务的classification task
  4. 实验

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

  1. repo:

  2. 动机

    • propose a semantic segmentation framework SegFormer
      • simple, lightweight, efficient, powerful
      • hierarchical transformer + MLP decoder
    • 特点

      • does not need positional encoding:在inference阶段切换图像分辨率不会引起性能变化
      • avoids complex decoders:MLP decoder主要就是merge multi levels
    • scale up to obtain a family of models:SegFormer-B0 to SegFormer-B5

    • verified on

      • SegFormer-B4:50.3% mIoU on ADE20K,SOTA
      • SegFormer-B5:84.0% mIoU on Cityscapes

  3. 论点

    • SETR
      • ViT-back + several CNN decoders
      • ViT主要是计算量 & single-scale issue
      • 后续methods提出PVT、Swin、Twins等,主要focus在优化multi-scale的backbone,忽略了decoder的设计
    • this paper (SegFormer)
      • redesigns both the encoder and the decoder
      • 改进的Transformer encoder:hierarchical & positional-encoding-free
      • 设计的all-MLP decoder:lightweight but powerful,设计的核心思想是 to take advantage of the Transformer-induced features where the attentions of lower layers tend to stay local, whereas the ones of the highest layers are highly non-local
  4. 方法

    • overview

      • encoder:
        • 使用了4x4的patch size,相比较于16x16的ViT,fine-grained patches is more preferred by semantic segmentation
        • multi-level features:x4,x8,x16,x32
      • decoder
        • 输入上述的multi-level features
        • 输出x4的segmentation mask
    • Hierarchical Transformer Encoder

      • We design a series of Mix Transformer encoders (MiT):MiT-B0 to MiT-B5
      • 基于PVT的efficient self-attention module
        • 针对原始attention block的平方时间复杂度
        • use a reduction ratio R to reduce the length of sequence K:注意是改变K的长度,而不是Q
        • given原始序列长度$N=HW$,feature dimensions $C$
          • 先reshape:$\hat K = Reshape(\frac{N}{R},CR)(K)$
          • 再降维:$K=Linear(CR,C)(\hat K)$
        • 计算量从O(N^2)下降到O(N^2/R)
        • set R to [64, 16, 4, 1] from stage-1 to stage-4
      • 同时提出了several novel designs
        • overlapped patch merging
          • 本文的一个论点是ViT用non-overlapping patches去做patch merging,相邻patch之间没有保留local continuity,所以需要positional encoding
          • 所以use an overlapping patch merging process
          • notations
            • patch size K=7/3
            • stride S=4/2
            • padding size P=3/1 (valid padding)
            • patch merging操作仍旧通过卷积来实现
        • positional-encoding-free design
          • ViT修改resolution要同步interpolatePE,还是会引起掉点
          • we introduce Mix-FFN
            • 在FFN中夹了一个conv3x3
            • sufficient to provide positional information
            • 甚至可以用depth-wise convolutions节省参数量
          • we argure that adding PE is not necessary in semantic segmentation
    • Lightweight All-MLP Decoder

      • idea是transformer有着远大于传统CNN的有效感受野,所以decoder可以轻量一点,不用再堆block

      • 4 main steps

        • unify:multi-level的feature maps各自通过MLP layer to unify the channel dimension
        • upsample:所有的features上采样到x4,biliear interp
        • fuse:concat + MLP(实际代码里用了1x1conv-bn-relu)
        • seg head:MLP,预测mask仍在x4尺度上

  5. 实验

    • training settings
      • AdamW:lr=2e-4,weight decay=1e-4
      • poly LR:power=0.9,by iteration