MuST谷歌多任务自训练

recollect

[SimCLR]

[MoCo]

Multi-Task Self-Training for Learning General Representations

动机
- learning general feature representations
- expect a single general model
  - 相比较于training specialized models for various tasks
  - harness from independent specialized teacher models
  - with a multi-task pseudo dataset
  - trained with multi-task learning
- evalutate on 6 vision tasks
  - image recognition (classification, detection, segmentation)
  - 3D geometry estimation
论点
- pretraining & transfer learning
  - transformer一般都是这个套路，BiT&ViT
  - pretraining
    - supervised / unsupervised
    - learn feature representations
  - transfer learning
    - on downstream tasks
    - the features may not necessarily be useful
    - 最典型的就是ImageNet pre-training并不能improve COCO segmentation，但是Objects365能够大幅提升
  - pretraining tasks必须要和downstream task align，learn specialized features，不然白费
- learning general features
  - a model simultaneously do well on multiple tasks
  - NLP的bert是一个典型用多任务提升general ability的
  - CV比较难这样做是因为标签variety，没有这样的大型multi-task dataset
- multi-task learning
  - shared backbone (如ResNet-FPN)
  - small task-specific heads
- self-training
  - use a supervised model to generate pseudo labels on unlabeled data
  - then a student model is trained on the pseudo labeled data
  - 在各类任务上都proved涨点
  - 但是迄今为止都是focused on a single task
- in this work
  - lack of large scale multi-task dataset的issue，通过self-training to fix，用pseudo label
  - specialized/general issue，通过多任务，训练目标就是六边形战士，absorb the knowledge of different tasks in the shared backbone
  - three steps
    - trains specialized teachers independently on labeled datasets （分类、分割、检测、深度估计）
    - the specialized teachers are then used to label a larger unlabeled dataset（ImageNet） to create a multi- task pseudo labeled dataset
    - train a student model with multi-task learning
  - MuST的特质
    - improve with more unlabeled data，数据越多general feature越好
    - can improve upon already strong checkpoints，在海量监督高精度模型基础上fine-tune，仍旧能在downstream tasks涨点
方法
- Specialized Teacher Models
  - 4 teacher models
    - classification：train from scratch，ImageNet
    - detection：train from scratch，Object365
    - segmentation：train from scratch，COCO
    - depth estimation：fine-tuning from pre-trained checkpoint
  - pseudo labeling
    - unlabeled / partially labeled datasets
    - for detection：hard score threshold of 0.5
    - for segmentation：hard score threshold of 0.5
    - for classification：soft labels——probs distribution
    - for depth：直接用
- Multi-Task Student Model
  - 模型结构
    - shared back
      - C5：for classification
      - feature pyramids {P3,P4,P5,P6,P7}：for detection
      - fused P2：for pixel-wise prediction，把feature pyramids rescale到level2然后sum
    - heads
      - classification head：ResNet design，GAP C5 + 线性层
      - object detection task：Mask R-CNN design，RPN是2 hidden convs，Fast R-CNN是4 hidden convs + 1 fc
      - pixel-wise prediction heads：3 hiddent convs + 1 linear conv head，分割和深度估计任务independent，不share heads
  - Teacher-student training
    - using the same architecture
    - same data augmentation
    - teacher和student的main difference就是dataset和labels
  - Learning From Multiple Teachers
    - every image has supervision for all tasks
    - labels may come from supervised or pseudo labels
    - 如果使用ImageNet数据集，classification就是真标签，det/seg/depth supervision则是伪标签
    - balance the loss contribution
      - 加权和，task-specific weights
      - for ImageNet，use $w_i = \frac{b_slr_{it}}{b_{it}lr_{s}}$
      - follow the scaling rule：lr和batch size成正比
      - except for depth loss
  - Cross Dataset Training
    - training across ImageNet, object365 and COCO
    - 有标签的就用原标签，没有的用伪标签，supervised labels and pseudo labels are treated equally，而不是分别采样和训练
    - balance the datasets：合在一起然后均匀采样
  - Transfer Learning
    - 得到general student model以后，fine-tune on 一系列downstream tasks
    - 这些downstream datasets与MuST model的训练数据都是not align的
    - 这个实验要证明的是supervised model（如teacher model）和self-supervised model（如用pseudo label训练出来的student model），在downstream tasks上迁移学习能performance是差不多的，【注意⚠️：如果迁移datasets前后align就不是这样了，pretrain显然会更好！！！】