label smoothing

  1. 动机

    • to understand label smoothing
      • improving generalization
      • improves model calibration
      • changes the representations learned by the penultimate layer of the network
      • effect on knowledge distillation of a student network
    • soft targets:a hard target and the uniform distribution of other classes
  2. 论点

    • label smoothing implicitly calibrates the learned models
      • 能让confidences更有解释性——more aligned with the accuracies of their predictions
      • label smoothing impairs distillation——teacher用了label smoothing,student会表现变差,this adverse effect results from loss of information in the digits
  3. 方法

    • modeling

      • penultimate layer:fc with activation
        • $p_k = \frac{e^{wx}}{\sum e^{wx}}$
      • outputs:loss
        • $H(y,p)=\sum_{k=1}^K -y_klog(p_k)$
      • hard targets:$y_k$ is 1 for the correct class and 0 for the rest
      • label smoothing:$y_k^{LS} = y_k(1-\alpha)+ \alpha /K$
    • visualization schem

      • 将dimK activation vector投影到正交平面上,a dim2 vector per example
      • clusters are much tighter because label smoothing encourages that each example in training set to be equidistant from all the other class’s templates

      • 3 classes shows triangle structure since ‘equidistant’

      • predictions‘ absolute values are much bigger without LM, representing over-confident
      • semantically similar classes are harder to separate,但是总体上cluster形态还是好一点
      • training without label smoothing there is continuous degree of change between two semantically similar classes,用了LM以后就观察不到了——相似class之间的语义相关性被破坏了,’erasure of information’
      • have similar accuracies despite qualitatively different clustering,对分类精度的提升不明显,但是从cluster形态上看更好看
    • model calibration

      • making the confidence of its predictions more accurately represent their accuracy

      • metric:expected calibration error (ECE)

      • reliability diagram

      • better calibration compared to the unscaled network

      • Despite trying to collapse the training examples to tiny clusters, these networks generalize and are calibrated:在训练集上的cluster分布非常紧凑,encourage每个样本都和其他类别的cluster保持相同的距离,但是在测试集上,样本的分布就比较松散了,不会限定在小小的一坨内,说明网络没有over-confident,representing the full range of confidences for each prediction

    • knowledge distillation

      • even when label smoothing improves the accuracy of the teacher network, teachers trained with label smoothing produce inferior student networks

      • As the representations collapse to small clusters of points, much of the information that could have helped distinguish examples is lost

      • 看training set的scatter,LM会倾向于将一类sample集中成为相似的表征,sample之间的差异性信息丢了:Therefore a teacher with better accuracy is not necessarily the one that distills better