light-weight transfomers

papers

[tinyViT ECCV2022] TinyViT: Fast Pretraining Distillation for Small Vision Transformers:微软,swin的todevice版本,用CLIP做unlabel KD

[topformer CVPR2022] TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation:

[mobileViT ICLR2022] MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER:苹果

[mobileformer CVPR2022] Mobile-Former: Bridging MobileNet and Transformer:微软

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

  1. abstract

    • task:semantic segmentation
    • take advantage of CNNs & ViTs
      • CNN-based Token Pyramid Module:MobileNetV2 blocks
      • a ViT-based module Semantics Extractor
    • verified on
      • ADE20K:比mobileNetV3高5%的miou,同时latency更低
  2. method

    • overview

      • Token Pyramid Module:输入image,输出token pyramid
      • Semantics Extractor:输入token pyramid,输出scale-aware semantics
      • Semantics Injection Module:将global semantics融合进对应level的local tokens里面,得到augmented representations
      • Segmentation Head:fuse and predict
    • Token Pyramid Module

      • use stacked MobileNet blocks来生成token pyramids
      • 不是为了丰富的语义信息,也不是为了更大的感受野,只是为了得到multi-scale tokens,所以用比较少的层

      • 输入512x512的图

      • 输出x4/x8/x16/x32的feature map
      • 然后将multi-scale的features统一avg pool到x64,然后concat by dim,得到[b,(16+32+64+96),64,64]的tokens
    • Scale-aware Semantics Extractor

      • 这部分用vision transformer来提取语义信息
      • 首先为了keep the shape,将linear layer换成1x1 conv,把layernorm换成了batchnorm
      • 所有ViT-stype GeLU换成ReLU6
      • MSA的QK投影维度为16,V投影维度为32,降低similarity的计算量
      • FFN的两层1x1卷积之间加了一个depth-wise卷积
    • Semantics Injection Module

      • fuse local tokens 和 scale-aware semantics
      • to alleviate the semantic gap
        • local token经过一个1x1 conv-bn,用semantic token经过一个1x1conv-bn-sigmoid生成的weight map做加权
        • 再加上一个1x1 conv-bn的semantic token
        • (这个semantic token应该是要上采样的吧?)
    • Segmentation Head

      • 所有的low-resolution tokens都上采样的high-resolution tokens的尺度(x8)
      • 然后element-sum
      • 然后两层卷积
  3. 实验

    • datasets
    • ADE20k:1+150,25k(20k/2k/3k)
      • PASCAL Context:1+59,4998/5105
    • COCO-Stuff10k:9k/1k
    • training details
      • iteration:ADE20k训练160k iteration,PASCAL和COCO训练80k iteration
      • syncBN
      • lr:baselr=0.00012,poly schedule with factor 1.0
    • weight decay:0.01

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

  1. abstract

    • tiny and efficient
    • KD on large-scale datasets
    • 用大模型的logits离线蒸馏
    • 精度:21M:ImageNet 84.8%,use larger resolution:86.5%
    • good transfer ability on downstream tasks
      • linear probe
      • few-shot learning
      • object detection:COCO AP 50.2%
  2. overview

    • 离线蒸馏
    • 保存数据增强的参数和teacher logits
  3. 方法

    • Fast Pretraining Distillation
      • direct pretraining small models on massive data does not bring much gains,especially when transferring to downstream tasks,但是蒸馏的boost就十分显著了
      • 一些方法在finetuning阶段进行蒸馏,本文focus on pretraining distillation:inefficient and costly
      • given image x
        • save strong data augmentation A (RandAugment & CutMix)
        • save teacher prediction $\hat y=T(A(x))$
      • 因为RandAugment内置随机性,同样的增强方式每次结果也不同,因为每个iteration的A都要存下来
      • 用ground truth的one-hot label去蒸馏的效果反而不如unlabeled logits,因为one-hot太overfit了
      • Sparse soft labels:imagnet21k的logits有21, 841个,为了节约只保存topK logits and indices,其余的用label smoothing
      • Data augmentation encoding:
        • 将一系列data aug params encode成一个scalar d
        • 然后通过decoder将其还原:PCG输入single parameter,输出a sequence of parameters
        • 实际实现的时候,就是存下一个d0,然后用PCG对T&S解码增强参数
    • Model Architectures
      • adopt hierarchical vision transformer,从大模型开始,定义一系列contraction factors,开始scaling down
        • patch embedding:2个3x3 convs,stride2,pad1
        • stage1:MBConvs & down sampling blocks
        • stage2/3/4:transformer blocks with window attention
        • attention biases
        • a 3 × 3 depthwise convolution between attention and MLP
        • 常规的residuals、LayerNorm依旧保留,conv的norm是BN,所有activation都是GELU
      • Contraction factors
        • embeded dimension:21M-[96,192,384,576],11M-[64,128,256,448],5M-[64,128,160,320]
        • number of blocks:[2,2,6,2]
        • window size:[7,14,7]
        • channel expansion ratio of the MBConv block:4
        • expansion ratio of MLP:4
        • head dimension:32