pre-training & self-training

[pre-training] Rethinking ImageNet Pre-training,He Kaiming,imageNet pre-training并没有真正helps acc,只是speedup,random initialization能够reach no worse的结果,前提是数据充足增强够猛,对小门小户还是没啥用,我们希望speedup

[pre-training & self-training] Rethinking Pre-training and Self-training,Google Brain,提出task-specific的pseudo label要比pre-training中搞出来的各种标签要好,前提还是堆数据,对小门小户没啥用,low-data下还是pre-train保平安

总体上都是针对跨任务下,imageNet pre-training意义的探讨,

  • 分类问题还是可以继续pretrained
  • kaiming这个只是fact,没有现实指导意义
  • google这个one step further,提出了self-training在现实条件中可以一试

Rethinking Pre-training and Self-training

  1. 动机

    • given fact:ImageNet pre-training has limited impact on COCO object detection
    • investigate self-training to utilize the additional data
  2. 论点

    • common practice pre-training
      • supervised pre-training
        • 首先要求数据有标签
        • pre-train the backbone on ImageNet as a classification task
      • 弱监督学习
        • with pseudo/noisy label
        • kaiming:Exploring the limits of weakly supervised pretraining
      • self-supervised pre-training
    • self-training paradigm on COCO
      • train an object detection model on COCO
      • generate pseudo labels on ImageNet
      • both labeled data are combined to train a new model
      • 基本基于noisy student的方法
    • observations
      • with stronger data augmentation, pre-training hurts the accuracy, but helps in self-training
      • both supervised and self-supervised pre-training methods fails
      • the benefit of pre-training does not cancel out the gain by self-training
      • flexible about unlabeled data sources, model architectures and computer vision tasks
  3. 方法

    • data augmentation
      • vary the strength of data augmentation as 4 levels
    • pre-training

      • efficientNet-B7
      • AutoAugment weights & noisy student weights

    • self-training

      • noisy student scheme
      • 实验发现self-training with this standard loss function can be unstable
      • implement a loss normalization technique
    • experimental settings
      • object detection
        • COCO dataset for supervised learning
        • unlabeled ImageNet and OpenImages dataset for self-training:score thresh 0.5 to generate pesudo labels
        • retinaNet & spineNet
        • batch:half supervised half pesudo
      • semantic segmentation
        • PASCAL VOC 2012 for supervised learning
        • augmented PASCAL & COCO & ImageNet for self-training:score thresh 0.5 to generate pesudo masks & multi-scale
        • NAS-FPN
  4. 实验

    • pre-training

      • Pre-training hurts performance when stronger data augmentation is used:因为会sharpen数据差异?
      • More labeled data diminishes the value of pre-training:通常我们的实验数据fraction都比较小的相对imageNet,所以理论上不会harm?
      • self-supervised pre-training也会一样harm,在augment加强的时候

    • self-training

      • Self-training helps in high data/strong augmentation regimes, even when pre-training hurts:不同的augment level,self-training对最终结果都有加成

      • Self-training works across dataset sizes and is additive to pre-training:不同的数据量,也都有加成,但是low data regime下enjoys the biggest gain

    • discussion

      • weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt
      • jointly training also helps:address the mismatch between two dataset
      • noisy labeling is worse than targeted pseudo labeling
    • 总体结论:小样本量的时候,pre-training还是有加成的,再加上self-training进一步提升,样本多的时候就直接self-training

Rethinking ImageNet Pre-training

  1. 动机
    • thinking random initialization & pre-training
    • ImageNet pre-training
      • speed up
      • but not necessarily improving
    • random initialization
      • can achieve no worse result
      • robust to data size, models, tasks and metrics
    • rethink current paradigm of ‘pre- training and fine-tuning’
  2. 论点
    • no fundamental obstacle preventing us from training from scratch
      • if use normalization techniques appropriately
      • if train sufficiently long
    • pre-training
      • speed up
      • when fine-tuning on small dataset new hyper-parameters must be selected to avoid overfitting
      • localization-sensitive task benefits limited from pre-training
      • aimed at communities that don’t have enough data or computational resources
  3. 方法
    • normalization
      • form
        • normalized parameter initialization
        • normalization layers
      • BN layers makes training from scratch difficult
        • small batch size degrade the acc of BN
        • fine-tuning可以freeze BN
        • alternatives
          • GN:对batch size不敏感
          • syncBN
      • with appropriately normalized initialization可以train from scratch VGG这种不用BN层的
    • convergence
      • pre-training model has learned low-level features that do not need to be re-learned during
      • random-initial training need more iterations to learn both low-level and semantic features
  4. 实验
    • investigate maskRCNN
      • 替换BN:GN/sync-BN
      • learning rate:
        • training longer for the first (large) learning rate is useful
        • but training for longer on small learning rates often leads to overfitting
    • 10k COCO往上,train from scratch results能够catch up pretraining results,只要训的够久
    • 1k和3.5k的COCO,converges show no worse,但是在验证集上差一些:strong overfitting due to lack of data
    • PASCAL的结果也差一点,因为instance和category都更少,not directly comparable to the same number of COCO images:fewer instances and categories has a similar negative impact as insufficient training data