pre-training & self-training

[pre-training] Rethinking ImageNet Pre-training，He Kaiming，imageNet pre-training并没有真正helps acc，只是speedup，random initialization能够reach no worse的结果，前提是数据充足增强够猛，对小门小户还是没啥用，我们希望speedup

[pre-training & self-training] Rethinking Pre-training and Self-training，Google Brain，提出task-specific的pseudo label要比pre-training中搞出来的各种标签要好，前提还是堆数据，对小门小户没啥用，low-data下还是pre-train保平安

总体上都是针对跨任务下，imageNet pre-training意义的探讨，

分类问题还是可以继续pretrained
kaiming这个只是fact，没有现实指导意义
google这个one step further，提出了self-training在现实条件中可以一试

Rethinking Pre-training and Self-training

动机
- given fact：ImageNet pre-training has limited impact on COCO object detection
- investigate self-training to utilize the additional data
论点
- common practice pre-training
  - supervised pre-training
    - 首先要求数据有标签
    - pre-train the backbone on ImageNet as a classification task
  - 弱监督学习
    - with pseudo/noisy label
    - kaiming：Exploring the limits of weakly supervised pretraining
  - self-supervised pre-training
    - 无标签的海量数据
    - 构造学习目标：autoencoder，contrastive，…
    - https://zhuanlan.zhihu.com/p/108906502
- self-training paradigm on COCO
  - train an object detection model on COCO
  - generate pseudo labels on ImageNet
  - both labeled data are combined to train a new model
  - 基本基于noisy student的方法
- observations
  - with stronger data augmentation, pre-training hurts the accuracy, but helps in self-training
  - both supervised and self-supervised pre-training methods fails
  - the benefit of pre-training does not cancel out the gain by self-training
  - flexible about unlabeled data sources, model architectures and computer vision tasks
方法
- data augmentation
  - vary the strength of data augmentation as 4 levels
- pre-training
  - efficientNet-B7
  - AutoAugment weights & noisy student weights
- self-training
  - noisy student scheme
  - 实验发现self-training with this standard loss function can be unstable
  - implement a loss normalization technique
- experimental settings
  - object detection
    - COCO dataset for supervised learning
    - unlabeled ImageNet and OpenImages dataset for self-training：score thresh 0.5 to generate pesudo labels
    - retinaNet & spineNet
    - batch：half supervised half pesudo
  - semantic segmentation
    - PASCAL VOC 2012 for supervised learning
    - augmented PASCAL & COCO & ImageNet for self-training：score thresh 0.5 to generate pesudo masks & multi-scale
    - NAS-FPN
实验
- pre-training
  - Pre-training hurts performance when stronger data augmentation is used：因为会sharpen数据差异？
  - More labeled data diminishes the value of pre-training：通常我们的实验数据fraction都比较小的相对imageNet，所以理论上不会harm？
  - self-supervised pre-training也会一样harm，在augment加强的时候
- self-training
  - Self-training helps in high data/strong augmentation regimes, even when pre-training hurts：不同的augment level，self-training对最终结果都有加成
  - Self-training works across dataset sizes and is additive to pre-training：不同的数据量，也都有加成，但是low data regime下enjoys the biggest gain
- discussion
  - weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt
  - jointly training also helps：address the mismatch between two dataset
  - noisy labeling is worse than targeted pseudo labeling
- 总体结论：小样本量的时候，pre-training还是有加成的，再加上self-training进一步提升，样本多的时候就直接self-training

Rethinking ImageNet Pre-training

动机
- thinking random initialization & pre-training
- ImageNet pre-training
  - speed up
  - but not necessarily improving
- random initialization
  - can achieve no worse result
  - robust to data size, models, tasks and metrics
- rethink current paradigm of ‘pre- training and fine-tuning’
论点
- no fundamental obstacle preventing us from training from scratch
  - if use normalization techniques appropriately
  - if train sufficiently long
- pre-training
  - speed up
  - when fine-tuning on small dataset new hyper-parameters must be selected to avoid overfitting
  - localization-sensitive task benefits limited from pre-training
  - aimed at communities that don’t have enough data or computational resources
方法
- normalization
  - form
    - normalized parameter initialization
    - normalization layers
  - BN layers makes training from scratch difficult
    - small batch size degrade the acc of BN
    - fine-tuning可以freeze BN
    - alternatives
      - GN：对batch size不敏感
      - syncBN
  - with appropriately normalized initialization可以train from scratch VGG这种不用BN层的
- convergence
  - pre-training model has learned low-level features that do not need to be re-learned during
  - random-initial training need more iterations to learn both low-level and semantic features
实验
- investigate maskRCNN
  - 替换BN：GN/sync-BN
  - learning rate：
    - training longer for the first (large) learning rate is useful
    - but training for longer on small learning rates often leads to overfitting
- 10k COCO往上，train from scratch results能够catch up pretraining results，只要训的够久
- 1k和3.5k的COCO，converges show no worse，但是在验证集上差一些：strong overfitting due to lack of data
- PASCAL的结果也差一点，因为instance和category都更少，not directly comparable to the same number of COCO images：fewer instances and categories has a similar negative impact as insufficient training data