noisy student

Self-training with Noisy Student improves ImageNet classification

动机
- semi-supervised learning（SSL）
- semi-supervised approach when labeled data is abundant
- use unlabeled images to improve SOTA model
- improve self-training and distillation
- accuracy and robustness
- better acc, mCE, mFR
  - EfficientNet model on labeled images
- student
  - even or larger student model
  - on labeled & pseudo labeled images
  - noise, stochastic depth, data augmentation
  - generalizes better
- process iteration
  - by putting back the student as the teacher
论点
- supervised learning which requires a large corpus of labeled images to work well
- robustness
  - noisy data：unlabeled images that do not belong to any category in ImageNet
  - large margins on much harder test sets
- training process
  - teacher
    - EfficientNet model on labeled images
  - student
    - even or larger student model
    - on labeled & pseudo labeled images
    - noise, stochastic depth, data augmentation
    - generalizes better
  - process iteration
    - by putting back the student as the teacher
- improve in two ways
  - it makes the student larger：因为用了更多数据
  - noised student is forced to learn harder：因为label有pseudo labels，input有各类augmentation，网络有dropout／stochastic depth
- main difference compared with Knowledge Distillation
  - use noise ——— KD do not use
  - use equal/larger student ——— KD use smaller student to learn faster
- think of as Knowledge Expansion
  - giving the student model enough capacity and difficult environments
  - want the student to be better than the teacher
方法
- algorithm
  - train teacher use labeled images
  - use teacher to inference unlabedled images, generating pseudo labels, soft/one-hot
  - train student model use labeled & unlabeld images
  - make student the new teacher, jump to the inter step
- noise
  - enforcing invariances：要求student网络能够对各种增强后的数据预测label一样，ensure consistency
  - required to mimic a more powerful ensemble model：teacher网络在inference阶段进行dropout和stochastic depth，behaves like an ensemble，whereas the student behaves like a single model，这就push student网络去学习一个更强大的模型
- other techniques
  - data filtering
    - we filter images that the teacher model has low confidences
    - 这部分data与training data的分布范围内
  - data balancing
    - duplicate images in classes where there are not enough images
    - take the images with the highest confidence when there are too many
- soft／hard pseudo labels
  - both work
  - soft slightly better
实验
- dataset
  - benchmarked dataset：ImageNet 2012 ILSVRC
  - unlabeled dataset：JFT
  - fillter & balancing：
    - use EfficientNet-B0
    - trained on ImageNet，inference over JFT
    - take images with confidence over 0.3
    - 130M at most per class
- models
  - EfficientNet-L2
    - further scale up EfficientNet-B7
    - wider & deeper
    - lower resolution
    - train-test resolution discrepancy
      - first perform normal training with a smaller resolution for 350 epochs
      - then finetune the model with a larger resolution for 1.5 epochs on unaugmented labeled images
      - shallow layers are fixed during finetuning
  - noise
    - stochastic depth：stochastic depth 0.8 for the final layer and follow the linear decay rule for other layers
    - dropout：dropout 0.5 for the final layer
    - RandAugment：two random operations with magnitude set to 27
- iterative training
  - 【teacher】first trained an EfficientNet-B7 on ImageNet
  - 【student】then trained an EfficientNet-L2 with the unlabeled batch size set to 14 times the labeled batch size
  - 【new teacher】trained a new EfficientNet-L2
  - 【new student】trained an EfficientNet-L2 with the unlabeled batch size set to 28 times the labeled batch size
  - 【iteration】…
- robustness test
  - difficult images
  - common corruptions and perturbations
  - FGSM attack
  - metrics
    - improves the top-1 accuracy
    - reduces mean corruption error (mCE)
    - reduces mean flip rate (mFR)
- ablation study
  - noisy
    - 如果不noise the student，当student model的预测和teacher预测的unlabeled数据完全一样的情况下，loss为0，不再学习，这样student就不能outperform teacher了
    - injecting noise to the student model enables the teacher and the student to make different predictions
    - The student performance consistently drops with noise function removed
    - removing noise leads to a smaller drop in training loss，说明noise的作用不是为了preventing overfitting，就是为了enhance model
  - iteration
    - iterative training is effective in producing increas- ingly better models
    - larger batch size ratio for latter iteration