noisy student

Self-training with Noisy Student improves ImageNet classification

  1. 动机

    • semi-supervised learning(SSL)
    • semi-supervised approach when labeled data is abundant
    • use unlabeled images to improve SOTA model
    • improve self-training and distillation
    • accuracy and robustness
    • better acc, mCE, mFR
      • EfficientNet model on labeled images
    • student
      • even or larger student model
      • on labeled & pseudo labeled images
      • noise, stochastic depth, data augmentation
      • generalizes better
    • process iteration
      • by putting back the student as the teacher
  2. 论点

    • supervised learning which requires a large corpus of labeled images to work well
    • robustness
      • noisy data:unlabeled images that do not belong to any category in ImageNet
      • large margins on much harder test sets
    • training process
      • teacher
        • EfficientNet model on labeled images
      • student
        • even or larger student model
        • on labeled & pseudo labeled images
        • noise, stochastic depth, data augmentation
        • generalizes better
      • process iteration
        • by putting back the student as the teacher
    • improve in two ways
      • it makes the student larger:因为用了更多数据
      • noised student is forced to learn harder:因为label有pseudo labels,input有各类augmentation,网络有dropout/stochastic depth
    • main difference compared with Knowledge Distillation
      • use noise ——— KD do not use
      • use equal/larger student ——— KD use smaller student to learn faster
    • think of as Knowledge Expansion
      • giving the student model enough capacity and difficult environments
      • want the student to be better than the teacher
  3. 方法

    • algorithm
      • train teacher use labeled images
      • use teacher to inference unlabedled images, generating pseudo labels, soft/one-hot
      • train student model use labeled & unlabeld images
      • make student the new teacher, jump to the inter step
    • noise
      • enforcing invariances:要求student网络能够对各种增强后的数据预测label一样,ensure consistency
      • required to mimic a more powerful ensemble model:teacher网络在inference阶段进行dropout和stochastic depth,behaves like an ensemble,whereas the student behaves like a single model,这就push student网络去学习一个更强大的模型
    • other techniques
      • data filtering
        • we filter images that the teacher model has low confidences
        • 这部分data与training data的分布范围内
      • data balancing
        • duplicate images in classes where there are not enough images
        • take the images with the highest confidence when there are too many
    • soft/hard pseudo labels
      • both work
      • soft slightly better
  4. 实验

    • dataset
      • benchmarked dataset:ImageNet 2012 ILSVRC
      • unlabeled dataset:JFT
      • fillter & balancing:
        • use EfficientNet-B0
        • trained on ImageNet,inference over JFT
        • take images with confidence over 0.3
        • 130M at most per class
    • models
      • EfficientNet-L2
        • further scale up EfficientNet-B7
        • wider & deeper
        • lower resolution
        • train-test resolution discrepancy
          • first perform normal training with a smaller resolution for 350 epochs
          • then finetune the model with a larger resolution for 1.5 epochs on unaugmented labeled images
          • shallow layers are fixed during finetuning
      • noise
        • stochastic depth:stochastic depth 0.8 for the final layer and follow the linear decay rule for other layers
        • dropout:dropout 0.5 for the final layer
        • RandAugment:two random operations with magnitude set to 27
    • iterative training

      • 【teacher】first trained an EfficientNet-B7 on ImageNet
      • 【student】then trained an EfficientNet-L2 with the unlabeled batch size set to 14 times the labeled batch size
      • 【new teacher】trained a new EfficientNet-L2
      • 【new student】trained an EfficientNet-L2 with the unlabeled batch size set to 28 times the labeled batch size
      • 【iteration】…
    • robustness test

      • difficult images
      • common corruptions and perturbations

      • FGSM attack

      • metrics

        • improves the top-1 accuracy
        • reduces mean corruption error (mCE)

        • reduces mean flip rate (mFR)

    • ablation study

      • noisy
        • 如果不noise the student,当student model的预测和teacher预测的unlabeled数据完全一样的情况下,loss为0,不再学习,这样student就不能outperform teacher了
        • injecting noise to the student model enables the teacher and the student to make different predictions
        • The student performance consistently drops with noise function removed
        • removing noise leads to a smaller drop in training loss,说明noise的作用不是为了preventing overfitting,就是为了enhance model
      • iteration
        • iterative training is effective in producing increas- ingly better models
        • larger batch size ratio for latter iteration