GHM | Less is More

families:

[class-imbalanced CE]
[focal loss]
[generalized focal loss] focal loss(CE)的连续版本
[ohem]

keras implementation:

def weightedCE_loss(y_true, y_pred):
		alpha = .8
		pt = K.abs(y_true-y_pred)
		# clip
    pt = K.clip(pt, K.epsilon(), 1-K.epsilon())
    # ce
    ce = -K.log(1.-pt)
    # pos/neg reweight
    wce = tf.where(y_true>0.5, alpha* , (1-alpha)* )
    return wce

def focal_loss(y_true, y_pred):
		alpha = .25
		gamma = 2
		
		pt = K.abs(y_true-y_pred)
		# clip
    pt = K.clip(pt, K.epsilon(), 1-K.epsilon())
    # easy/hard reweight
		fl = -K.pow(pt, gamma) * K.log(1.-pt)
    # pos/neg reweight
    fl = tf.where(y_true>0.5, alpha*fl, (1-alpha)*fl)
    return fl
  
 def generalized_focal_loss(y_true, y_pred):
		# CE = -ytlog(yp)-(1-yt)log(1-yp)
  	# GFL = |yt-yp|^beta * CE
    beta = 2
    # clip y_pred
    y_pred = K.clip(y_pred, K.epsilon(), 1-K.epsilon())
    # ce
    ce = -y_true*K.log(y_pred) - (1-y_true)*K.log(1-y_pred)   # [N,C]
    # easy/hard reweight
    gfl = K.pow(K.abs(y_true-y_pred), beta) * ce
    return gfl

def ce_ohem(y_true, y_pred):
  	pt = K.abs(y_true-y_pred)
		# clip
    pt = K.clip(pt, K.epsilon(), 1-K.epsilon())
    # ce
    ce = -K.log(1.-pt)
    # sort loss
    k = 50
    ohem_loss, indices = tf.nn.top_k(ce, k=k)   # topk loss: [k,], topk indices: [k,], idx among 0-b
    mask = tf.where(ce>=ohem_loss[k-1], tf.ones_like(ce), tf.zeros_like(ce))
    return mask*ce

Gradient Harmonized Single-stage Detector

动机
- one-stage detector
  - 核心challenge就是imbalance issue
  - imbalance between positives and negatives
  - imbalance between easy and hard examples
  - 这两项都能归结为对梯度的作用：a term of the gradient
- we propose a novel gradient harmonizing mechanism (GHM)
  - balance the gradient flow
  - easy to embed in cls/reg losses like CE/smoothL1
  - GHM-C for anchor classification
  - GHM-R for bounding box refinement
- proved substantial improvement on COCO
  - 41.6 mAP
  - surpass FL by 0.8
论点
- imbalance issue
  - easy and hard：
    - OHEM
    - directly abandon examples
    - 导致训练不充分
  - positive and negative
    - focal loss
    - 有两个超参，跟data distribution绑定
    - not adaptive
  - 通常正样本既是少量样本又是困难样本，而且可以通通归结为梯度分布不均匀的问题
    - 大量样本只贡献很小的梯度，通常对应着大量负样本，总量多了也可能会引导梯度（左图）
    - hard样本要比medium样本数量大，我们通常将其看作离群点，因为模型稳定以后这些hard examples仍旧存在，他们会影响模型稳定性（左图）
    - GHM的目标就是希望不同样本的gradient contribution保持harmony，相比较于CE和FL，简单样本和outlier的total contribution都被downweight，比较harmony（右图）
- we propose gradient harmonizing mechanism (GHM)
  - 希望不同样本的gradient contribution保持harmony
  - 首先研究gradient density，按照梯度聚类样本，并相应reweight
  - 针对分类和回归设计GHM-C loss和GHM-R loss
  - verified on COCO
    - GHM-C比CE好得多，sligtly better than FL
    - GHM-R也比smoothL1好
    - attains SOTA
  - dynamic loss：adapt to each batch
方法
- Problem Description
  - define gradient norm $g = |p - p^*|$
  - the distribution g from a converged model
    - easy样本非常多，不在一个数量级，会主导global gradient
    - 即使收敛模型也无法handle一些极难样本，这些样本梯度与其他样本差异较大，数量还不少，也会误导模型
- Gradient Density
  - define gradient density $GD(g) = \frac{1}{l_{\epsilon}(g)} \sum_{k=1} \delta_{\epsilon}(g_k,g)$
    - given a gradient value g
    - 统计落在中心value为$g$，带宽为$\epsilon$的范围内的梯度的样本量
    - 再用带宽去norm
  - define the gradient density harmony parameter $\beta_i = \frac{N}{GD(g_i)}$
    - N是总样本量
    - 其实就是与density成反比
    - large density对应样本会被downweight
- GHM-C Loss
  - 将harmony param作为loss weight，加入现有loss
  - - 可以看到FL主要压简单样本（基于sample loss），GHM两头压（基于sample density）
    - 最终harmonize the total gradient contribution of different density group
    - dynamic wrt mini-batch：使得训练更加efficient和robust
  - Unit Region Approximation
    - 将gradient norm [0,1]分解成M个unit region
    - 每个region的宽度$\epsilon = \frac{1}{M}$
    - 落在每个region内的样本数计作$R_{ind(g)}$，$ind(g)$是g所在region的start idx
    - the approximate gradient density：$\hat {GD}(g) = \frac{R_{ind(g)}}{\epsilon} =R_{ind(g)}M $
    - approximate harmony parameter & loss：
      - we can attain good performance with quite small M
      - 一个密度区间内的样本可以并行计算，计算复杂度O(MN)
  - EMA
    - 一个mini-batch可能是不稳定的
    - 所以通过历史累积来更新维稳：SGDM和BN都用了EMA
    - 现在每个region里面的样本使用同一组梯度，我们对每个region的样本量应用了EMA
      - t-th iteraion
      - j-th region
      - we have $R_j^t$
      - apply EMA：$S_j^t = \alpha S_j^(t-1) + (1-\alpha )R_j^t$
      - $\hat GD(g) = S_{ind(g)} M$
    - 这样gradient density会更smooth and insensitive to extreme data
- GHM-R loss
  - smooth L1：
    - 通常分界点设置成$\frac{1}{9}$
    - SL1在线性部分的导数永远是常数，没法去distinguishing of examples
    - 用$|d|$作为gradient norm则存在inf
  - 所以先改造smooth L1：Authentic Smooth L1
    - $\mu=0.02$
    - 梯度范围正好在[0,1)
  - define gradient norm as $gr = |\frac{d}{\sqrt{d^2+\mu^2}}|$
    - 观察converged model‘s gradient norm for ASL1，发现大量是outliers
    - 同样用gradient density进行reweighting
    - 收敛状态下，不同类型的样本对模型的gradient contribution
      - regression是对所有正样本进行计算，主要是针对离群点进行downweighting
      - 这里面的一个观点是：在regression task里面，并非所有easy样本都是不重要的，在分类task里面，easy样本大部分都是简单的背景类，但是regression分支里面的easy sample是前景box，而且still deviated from ground truth，仍旧具有充分的优化价值
      - 所以GHM-R主要是upweight the important part of easy samples and downweight the outliers
实验