label smoothing

动机
- to understand label smoothing
  - improving generalization
  - improves model calibration
  - changes the representations learned by the penultimate layer of the network
  - effect on knowledge distillation of a student network
- soft targets：a hard target and the uniform distribution of other classes
论点
- label smoothing implicitly calibrates the learned models
  - 能让confidences更有解释性——more aligned with the accuracies of their predictions
  - label smoothing impairs distillation——teacher用了label smoothing，student会表现变差，this adverse effect results from loss of information in the digits
方法
- modeling
  - penultimate layer：fc with activation
    - $p_k = \frac{e^{wx}}{\sum e^{wx}}$
  - outputs：loss
    - $H(y,p)=\sum_{k=1}^K -y_klog(p_k)$
  - hard targets：$y_k$ is 1 for the correct class and 0 for the rest
  - label smoothing：$y_k^{LS} = y_k(1-\alpha)+ \alpha /K$
- visualization schem
  - 将dimK activation vector投影到正交平面上，a dim2 vector per example
  - clusters are much tighter because label smoothing encourages that each example in training set to be equidistant from all the other class’s templates
  - 3 classes shows triangle structure since ‘equidistant’
  - predictions‘ absolute values are much bigger without LM, representing over-confident
  - semantically similar classes are harder to separate，但是总体上cluster形态还是好一点
  - training without label smoothing there is continuous degree of change between two semantically similar classes，用了LM以后就观察不到了——相似class之间的语义相关性被破坏了，’erasure of information’
  - have similar accuracies despite qualitatively different clustering，对分类精度的提升不明显，但是从cluster形态上看更好看
- model calibration
  - making the confidence of its predictions more accurately represent their accuracy
  - metric：expected calibration error (ECE)
  - reliability diagram
  - better calibration compared to the unscaled network
  - Despite trying to collapse the training examples to tiny clusters, these networks generalize and are calibrated：在训练集上的cluster分布非常紧凑，encourage每个样本都和其他类别的cluster保持相同的距离，但是在测试集上，样本的分布就比较松散了，不会限定在小小的一坨内，说明网络没有over-confident，representing the full range of confidences for each prediction
- knowledge distillation
  - even when label smoothing improves the accuracy of the teacher network, teachers trained with label smoothing produce inferior student networks
  - As the representations collapse to small clusters of points, much of the information that could have helped distinguish examples is lost
  - 看training set的scatter，LM会倾向于将一类sample集中成为相似的表征，sample之间的差异性信息丢了：Therefore a teacher with better accuracy is not necessarily the one that distills better