DA-WSOL: object localization

Weakly Supervised Object Localization as Domain Adaption

动机
- Weakly supervised object localization (WSOL)
  - localizing objects only with the supervision of image-level classification label
  - previous method use classification structure and CAM to generate the localization score：CAM通常不完全focus在目标上，定位能力weak
- our method
  - 将任务建模成域迁移任务，domain-adaption(DA) task
  - score estimiator用image-level信息来训练，用像素级信息来测试
  - a DA-WSOL pipeline
    - a target sampling strategy
    - domain adaption localization (DAL) loss
论点
- CAM的表现不佳
  - 核心在于domain shift：用分类架构，训练一个分类任务，是对image-level feature的优化，但是目标却是 localization score，这是pixel-level feature，两者之间没有建立联系
  - 最终estimator会get overfitting on source domain(也就是image-level target)
  - 一个直观的想法：引入DA，align the distribution of these two domains，avoid overfitting——activating the discriminative parts of objects
- mechanisms
  - B: Multi-instance learning(MIL) based WSOL
    - 分类架构
    - 类别目标驱动
    - 通过各种data augmentation/cut mix来strengthen
    - 印象里原论文是先训一个纯分类网络（CNN+fc），然后去掉头，改成CNN+GAP+fc，做finetuning，得到能产生CAM的网络(提取对应类别权重对特征图加权)，因为需要两步训练，后面如果要看cam一半都是用grad-cam(用梯度的平均作为权重，无需重新训练)，performance据说等价
  - C: Separated-structure based WSOL
    - 一个目标分类任务
    - 一个目标定位任务：伪标签、多阶段
  - A: Domain Adaption
    - 引入DA to better assist WSOL task：align feature distribution between src/tgt domain
    - end-to-end
    - a target sampling strategy
      - target domain features has a much larger scale than source domain features：显然image-level task下，训练出的特征提取器更多的保留前景特征，但是pixel-level下还包含背景之类的
      - sampling旨在选取source-related target samples & source unseen target samples
    - domain adaption localization (DAL) loss
      - 上述的两类samples fed into这个loss
      - source-related target samples：to solve the sample imbal- ance between two domains
      - source unseen target samples：viewed as the Universum to perceive target cues
方法
- Revisiting the WOSL
  - task description：给定$image X\in R^{3 \times N} $，3通道N个pixel，需要分辨任意像素$X_{:,i}$是否属于a certain class 0-k
  - a feature extractor f(~)：用来提取pixel-level features $Z = f(X) \in R^{C \times N}$
  - a score estimator e(~)：用来估计pixel的localization score $Y=e(Z) \in R^{K \times N}$
  - 在有监督场景下，pixel-level target Y是直接给定的，但是在无监督场景下，我们只有image-level mask，即$y=(max(Y_{0:}), max(Y_{1:}), …, max(Y_{k:})) \in R^{K \times 1}$，即每个feature map的global max/avg value构成的feature vector
  - an additional aggregator g (~)：用来聚合feature map，将pixel-level feature转换成image-level $z=g(Z) \in R^{C\times 1}$，如GAP
  - 然后再fed into the score estimator above：$y^* = e(z) \in R^{K \times 1}$
  - 用classification loss来监督$y$和$y^*$，这就是一个分类任务
  - but at test time：the estimator is projected back onto the pixel-level feature Z to predict the localization scores，这就是获取CAM
- Modeling WSOL as Domain Adaption
  - 对每个sample X，建立两个feature sets S & T
    - source domain：$s = z = (gf)(X)$
    - target domain：$\{t_1,t_2,…,t_N\} =\{Z_{1,:},Z{2,:},…,Z{N,:}\}=f(X)$
  - aim at minimizing the target risk without accessing the target label set (pixel-level mask)，可以转化为：
    - minimizing source risk
    - minimizing domain discrepancy
    - $L(S,Y^S,T) = L_{cls}(S,Y^S) + L_a(S,T)$
  - loss
    - L_cls就是常规的分类loss，在image-level上train
    - L_a是proposed adaption loss，用来最小化S和T的discrepancy，会force f(~)和g (~)去学习domain-invariant features
    - 使得 e(~)在S和在T上的performance相似
  - properties to notice
    - some samples在set T中存在，而在set S中不存在，如background，不能强行align两个set
    - 两个分布的scale比例十分imbalance，1:N——the S set in some degree insufficient
    - 两个分布的差异是已知的，就是aggregator g (~)，这是个先验知识
  - mechanism as in figure
    - 起初两个分布并不一致，方框1/2是image level feature，圆圈1/2是pixel level feature，圆圈问号类是pixel map上的bg patches
    - 用class loss去监督如CAM method，能够区分方框1/2，在一定程度上也能够区分圆圈1/2，但是不能精准定位目标，因为S和T存在discrepancy——bg patches
    - 引入domain adaption loss，能够tighten两个set，使得两个分布更加align，这样class bound在方框1/2和圆圈1/2上的效果都一样好
    - 引入Universum regularization，推动decision boundary into Universum samples，使得分类边界也有意义
- Domain Adaption Localization Loss $L_a$(~)
  - 首先进一步切分target set T
    - Universum $T^u$：不包含前景类别/bg样本
    - the fake $T^f$：和source domain的sample highly correlated的样本（在GAP时候被保留下来的样本）
    - the real $T^t$：aside from above two的样本
  - recall the properties
    - the fake之所以会highly correlated source domain，就是因为先验知识GAP (property3)，我们知道他就是在GAP阶段被留下来的target sample
    - 我们可以将其作为source domain的补充样本，以弥补insufficient issue (property2)
    - 关于unmatched data space (property1)，T-Universum就与S保持same label space了
  - based on these subsets，overall的DAL loss包含两个部分
    - domain adaption loss $L_d$：UDA，unsupervised，align the distribution
    - Universum regularization $L_u$：feature-based L1 regularization，所有bg像素的绝对值之和，如果他们都在分类边界上，不属于任何一个前景类，localization score的响应值就都是0，那么loss就是0
    - $L_a(S,T) = \lambda_1L_d(S \cup T^f, T^t) + \lambda_2 L_u(T^u)$
- Target Sampling Strategy （这个有点不太理解）
  - a target sample assigner(TSA)
  - a cache matrix $M \in R^{C \times (K+1)}$
    - column 0：represents the anchor of $T^u$
    - the rest column：represents the anchor of certain class of $T^t$
    - 感觉就是记录了每类column vec的簇心
  - init
    - column 0：zero-init
    - the rest：当遇到第一个这一类的样本的时候，用src vec $z+\epsilon$初始化
  - update
    - 首先基于image-level label得到类别id：$k = argmax(y)$，注意使用ground truth，不是prediction vec
    - 然后拿到cache matrix中对应的anchor：$a^u = M_{:,0}, \ \ a^t = M_{:,k+1}$
    - 然后再加上image-level predict作为初始的cluster：$C_{init} = \{a^u, a^t, z\} \in R^{C \times 3}$
    - 对当前target samples做K-means，得到三类样本，进而计算adaption loss
    - 用聚类得到的新center C，加权平均更新cache matrix，权重$r_k$是对应类images的数目的倒数
  - overall
- pipeline summary
  - 首先获得S和T，f(~)是classification backbone(resnet/inception)，g(~)是 global average pooling，e(~)是作用在source domain feature vector的 fully-connected layer ，generate the image-level classification score，supervised with cross-entropy
  - 然后通过S、T以及ground truth label id得到3个target subsets
    - $T^u$用来计算$L_u$
    - $S$和$T^f$和$T^t$用来计算$L_d$：MMD (Maximum Mean Discrepancy)，h(~)是高斯kernel