MoCo系列

papers：

[2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning，kaiming

[2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations，Google Brain，混进来是因为它improve based on MoCo v1，而MoCo v2/v3又都是基于它改进

[2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning，kaiming

[2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers，kaiming

preview: 自监督学习 Self-supervised Learning

reference：https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html
overview
- 就是无监督
- 针对的痛点（有监督训练模型）
  - 标注成本高
  - 迁移性差
- 会基于数据特点，设置Pretext tasks（最常见的任务就是生成/重建），构造Pesdeo Labels来训练网络
- 通常模型用来作为其他学习任务的预训练模型
- 被认为是用来学习图像的通用视觉表示
methods
- 从结构上区分主要就是两大类方法
  - 生成式：通过encoder-decoder结构还原输入，监督信号是输入输出尽可能相似
    - 重建任务开销大
    - 没有建立直接的语义学习
    - 外加GAN的判别器使得任务更加复杂难训
  - 判别式：输入两张图片，通过encoder编码，监督信号是判断两张图是否相似，判别式模型也叫Contrastive Learning
- 从Pretext tasks上划分主要分为三类
  - 基于上下文（Context based）：如bert的MLM，在句子/图片中随机扣掉一部分，然后推动模型基于上下文/语义信息预测这部分/相对位置关系
  - 基于时序（Temporal Based）：如bert的NSP，视频/语音，利用相邻帧的相似性，构建不同排序的序列，判断B是否是A的下一句/是否相邻帧
  - 基于对比（Contrastive Based）：比较正负样本，最大化相似度的loss在这里面被叫做InfoNCE
memory-bank
- Contrastive Based方法最常见的方式是在一个batch中构建正负样本进行对比学习
  - end-to-end
  - 每个mini-batch中的图像增强前后的两张图片互为正样本
  - 字典大小就是minibatch大小
- memory bank包含数据集中所有样本编码后特征
  - 随机采样一部分作为keys
  - 每个迭代只更新被采样的样本编码
  - 因为样本编码来自不同的training step，一致性差
- MoCo
  - 动态编码库：out-of-date的编码出列
  - momentum update：一致性提升
InfoNCE
- deep mind在CPC(Contrastive Predictive Coding)提出，论文以后有机会再展开
  - unsupervised
  - encoder：encode x into latent space representations z，resnet blocks
  - autoregressive model：summarize each time-step set of {z} into a context representation c，GRUs
  - probabilistic contrastive loss
    - Noise-Contrastive Estimation
    - Importance Sampling
- 训练目标是输入数据x和context vector c之间的mutual information
  - 每次从$p(x_{t+k}|c_t)$中采样一个正样本：正样本是这个序列接下来预测的东西，和c的相似性肯定要高于不想干的token
  - 从$p(x_{t+k})$中采样N-1个负样本：负样本是别的序列里面随机采样的东西
  - 目标是让正样本与context相关性高，负样本低

MoCo v1: Momentum Contrast for Unsupervised Visual Representation Learning

动机
- unsupervised visual representation learning
- contrastive learning
- dynamic dictionary
  - large
  - consisitent
- verified on
  - 7 down-stream tasks
  - ImageNet classification
  - VOC & COCO det/seg
论点
- Unsupervised representation learning
  - highly successful in NLP，in CV supervised is still the main-stream
  - 两个核心
    - pretext tasks
    - loss functions
  - loss functions
    - 生成式方法的loss是基于prediction和一个fix target来计算的
    - contrastive-based的key target则是vary on-the-fly during training
    - Adversarial losses没展开
  - pretext tasks
    - tasks involving recover：auto-encoder
    - task involving pseudo-labels：通常有个exemplar/anchor，然后计算contrastive loss
  - contrastive learning VS pretext tasks
    - 大量pretext tasks可以通过设计一些contrastive loss来实现
- recent approaches using contrastive loss
  - dynamic dictionaries
    - 由keys组成：sampled from data & represented by an encoder
  - train the encoder to perform dictionary look-up
    - given an encoded query
    - similar to its matching key and dissimilar to others
- desirable dictionary
  - large：better sample
  - consistent：training target consistent
- MoCo：Momentum Contrast
  - queue
  - 每个it step的mini-batch的编码入库
  - the oldest are dequeued
  - EMA：
    - a slowly progressing key encoder
    - momentum-based moving average of the query encoder
  - similar的定义：q & k are from the same image
方法
- contrastive learning
  - a encoded query $q$
  - a set of encoded samples $\{k_0, k_1, …\}$
  - assume：there is a single key $k_+$ in the dictionary that $q$ matches
  - similarity measurement：dot product
  - InfoNCE：
    - $L_q = -log \frac{exp(qk_+/\tau)}{\sum_0^K exp(qk/\tau)}$
    - 1 positive & K negtive samples
    - 本质上是个softmax-based classifier，尝试将$q$分类成$k_+$
  - unsupervised workflow
    - with a encoder network $f_q$ & $f_k$
    - thus we have query & sample representation $q=f_q(x^q)$ & $k=f_k(x^k)$
    - inputs $x$ can be images/patches/context(patches set)
    - $f_q$ & $f_k$ can be identical/partially shared/different
- momentum contrast
  - dictionary as a key
    - the dictionary always represents a sampled subset of all data
    - the current mini-batch入列
    - the oldest mini-batch出列
  - momentum update
    - large dictionary没法对keys进行back-propagation：因为sample太多了
    - only $f_q$ are updated by back-propagation：mini-batch
    - naive solution：copy $f_q$的参数给$f_k$，yields poor results，因为key encoder参数变化太频繁了，representation inconsistent issue
    - momentum update：$f_k = mf_k + (1-m)f_q$，$m=0.999$
    - 三种更新方式对比
      - 第一种end-to-end method：
        
        use samples in current mini-batch as the dictionary
        
        keys are consistently encoded
        
        dictionary size is limited
      - 第二种memory bank
        
        A memory bank consists of the representations of all samples in the dataset
        
        the dictionary for each mini-batch is randomly sampled from the memory bank，不进行bp，thus enables large dictionary
        
        key representation is updated when it was last seen：inconsistent
        
        有些也用momentum update，但是是用在representation上，而不是encoder参数
  - pretext task
    - define positive pair：if the query and the key come from the same image
    - 我们从图上take two random views under random augmentation to form a positive pair
    - 然后用各自的encoder编码成q & k
    - 每一对计算similarity：pos similarity
    - 然后再计算input queries和dictionary的similarity：neg similarity
    - 计算ce，update $f_q$
    - 用$f_q$ update $f_k$
    - 把k加入dictionary队列
    - 把最早的mini-batch出列
    - 技术细节
      - resnet：last fc dim=128，L2 norm
      - temperature $\tau=0.07$
      - augmentation
        
        random resize + random(224,224) crop
        
        random color jittering
        
        random horizontal flip
        
        random grayscale conversion
      - shuffling BN
        
        实验发现使用resnet里面的BN会导致不好的结果：猜测是intra-batch communication引导模型学习了一种cheating的low-loss solution
        
        具体做法是给$f_k$的输入mini-batch先shuffle the order，然后进行fp，然后再shuffle back，这样$f_q$和$f_k$的BN计算的mini-batch的statics就是不同的
实验

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

动机
- simplify recently proposed contrastive self-supervised learning algorithms
- systematically study the major components
  - data augmentations
  - learnable unlinear prediction head
  - larger batch size and more training steps
- outperform previous self-supervised & semi-supervised learning methods on ImageNet
论点
- discriminative approaches based on contrastive learning
  - maximizing agreement between differently augmented views of the same data sample
  - via a contrastive loss in the latent space
- major components & conclusions
  - 数据增强很重要，unsupervised比supervised benefits more
  - 引入的learnable nonlinear transformation提升了representation quality
  - contrastive cross entropy loss受益于normalized embedding和adjusted temperature parameter
  - larger batch size and more training steps很重要，unsupervised比supervised benefits more
方法
- common framework
  - 4 major components
    - 随机数据增强
      - results in two views of the same sample，构成positive pair
      - crop + resize back + color distortions + gaussian blur
    - base encoder
      - 用啥都行，本文用了resnet including the GAP
    - a projection head
      - 将representation dim映射到the space where contrastive loss is applied（given 1 pos pair & N neg pair，就是N+1 dim）
      - 之前有方法直接用linear projection
      - 我们用了带一个hidden layer的MLP：fc-bn-relu-fc
    - a contrastive loss
  - overall workflow
    - random sample a minibatch of N
    - random augmentation results in 2N data points
    - 对每个样本来讲，有1个positive pair，其余2(N-1)个data points都是negative samples
    - set cosine similarity $sim(u,v)=u^Tv/|u||v|$
    - given positive pair $(i,j)$ then the loss is $l_{i,j} = -log \frac{exp(s_{i,j}/\tau)}{\sum_{k\neq i}^{2N} exp(s_{i,k}/\tau)}$
    - 对每个positive pair都计算，包括$(i,j)$ 和$(j,i)$，叫那个symmetrized loss
    - update encoder
- training with large batch size
  - batch 8192，negatives 16382
  - 大batch时，linear learning rate scaling可能不稳定，所以用了LARS optmizer
  - global BN，aggregate BN mean & variance over all devices
  - TPU

MoCo v2: Improved Baselines with Momentum Contrastive Learning

动机
- still working on contrastive unsupervised learning
- simple modifications on MoCo
  - introduce two effective SimCLR’s designs：
  - an MLP head
  - more data augmentation
  - requires smaller batch size than SimCLR，making it possible to run on GPU
- verified on
  - ImageNet classification
  - VOC detection
论点
- MoCo & SimCLR
  - contrastive unsupervised learning frameworks
  - MoCo v1 shows promising
  - SimCLR further reduce the gap
  - we found two design imrpovements in SimCLR 在两个方法中都work，而且用在MoCo中shows better transfer learning results
    - an MLP projection head
    - stronger data augmentation
  - 同时MoCo framework相比较于SimCLR ，远不需要large training batches
    - SimCLR based on end-to-end mechanism，需要比较大的batch size，来提供足够多的negative pair
    - MoCo则用了动态队列，所以不限制batch size
- SimCLR
  - improves the end-to-end method
  - larger batch：to provide more negative samples
  - output layer：replace fc with a MLP head
  - stronger data augmentation
- MoCo
  - a large number of negative samples are readily available
  - 所以就把后两项引入进来了
方法
- MLP head
  - 2-layer MLP(hidden dim=2048, ReLU)
  - 仅影响unsupervised training，有监督transfer learning的时候换头
  - temperature param调整：从default 0.07 调整成optimal value 0.2
- augmentation
  - add blur
  - SimCLR还用了stronger color distortion：we found stronger color distortion in SimCLR hurts in our MoCo，所以没加
实验
- ablation
  - MLP：在分类任务上的提升比检测大
  - augmentation：在检测上的提升比分类大
- comparison
  - large batches are not necessary for good acc：SimCLR longer training那个版本精度更高
  - end-to-end的方法肯定more costly in memory and time：因为要bp两个encoder

MoCo v3: An Empirical Study of Training Self-Supervised Visual Transformers

动机
- self-supervised frameworks that based on Siamese network, including MoCo
- ViT：study the fundamental components for training self-supervised ViT
- MoCo v3：an incremental improvement of MoCo v1/2，striking for a better balance of simplicity & accuracy & scalability
- instability is a major issue
- scaling up ViT models
  - ViT-Large
  - ViT-Huge
论点
- we go back to the basics and investigate the fundamental components of training deep neural networks
  - batch size
  - learning rate
  - optmizer
- instability
  - instability is a major issue that impacts self-supervised ViT training
  - but may not result in catastrophic failure，只会导致精度损失
  - 所以称之为hidden degradation
  - use a simple trick to improve stability：freeze the patch projection layer in ViT
  - and observes increasement in acc
- NLP里面基于masked auto-encoding的framework效果要比基于contrastvie的framework好，图像正好反过来
方法
- MoCo v3
  - take two crops for each image under random augmentation
  - encoded by two encoders $f_q$ & $f_k$ into vectors $q$ & $k$
  - we use the keys that naturally co-exist in the same batch
    - abandon the memory queue：因为发现batch size足够大（4096）的时候，memory queue就没啥acc gain了
    - 回归到batch-based sample pair
  - 但是encoder k仍旧不回传梯度，还是基于encoder q进行动量更新
  - symmetrized loss：
    - $ctr(q_1, k_2) + ctr(q_2,k_1)$
    - InfoNCE
    - temperature
    - 两个crops分别计算ctr
- encoder
  - encoder $f_q$
    - a backbone
    - a projection head
    - an extra prediction head
  - encoder $f_k$
    - a backbone
    - a projection head
  - encoder $f_k$ is updated by the moving average of $f_q$，excluding the prediction head
- baseline acc
  - basic settings，主要变动就是两个：
    - dynamic queue换成large batch
    - encoder $f_q$的extra prediction head
- use ViT
  - 直接用ViT替换resnet back met instability issue
  - batch size
    - ViT里面的一个观点就是，model本身比较heavy，所以large batch is desirable
    - 实验发现
      - a batch of 1k & 2k produces reasonably smooth curves：In this regime, the larger batch improves accuracy thanks to more negative samples
      - a batch of 4k 有明显的untable dips：
      - a batch of 6k has worse failure patterns：我们解读为在跳水点，training is partially restarted and jumps out of the current local optimum
  - learning rate
    - lr较小，training比较稳定，但是容易欠拟合
    - lr过大，会导致unstable，也会影响acc
    - 总体来说精度还是决定于stability
  - optimizer
    - default adamW，batch size 4096
    - 有些方法用了LARS & LAMB for large-batch training
    - LAMB
      - sensitive to lr
      - optmal lr achieves slightly better accuracy than AdamW
      - 但是lr一旦过大，acc极速drop
      - 但是training curves still smooth，虽然中间过程有drop：我们解读为LAMB can avoid sudden change in the gradients，但是避免不了negative compact，还是会累加
  - a trick for improving stability
    - we found a spike in gradient causes a dip in the training curve
    - we also observe that gradient spikes happen earlier in the first layer (patch projection)
    - 所以尝试freezing the patch projection layer during training，也就是一个random的patch projection layer
      - This stability benefits the final accuracy
      - The improvement is bigger for a larger lr
      - 在别的ViT-back-framework上也有效（SimCLR、BYOL）
  - we also tried BN，WN，gradient clip
    - BN/WN does not improve
    - gradient clip在threshold足够小的时候有用，推到极限就是freezing了
- implementation details
  - AdamW
  - batch size 4096
  - lr：warmup 40 eps then cosine decay
- MLP heads
  - projection head：3-layers，4096-BN-ReLU-4096-BN-ReLU-256
  - prediction head：2-layers，4096-BN-ReLU-256
- loss
  - ctr里面有个scale的参数，$2\tau$
  - makes it less sensitive to $\tau$ value
  - $\tau=0.2$
- ViT architecture
  - 跟原论文保持一致
  - 输入是224x244的image，划分成16x16/14x14的patch sequence，project成256d/196d的embedding
  - 加上sine-cosine-2D的PE
  - 再concat一个cls token
  - 经过一系列transformer blocks
  - The class token after the last block (and after the final LayerNorm) is treated as the output of the backbone，and is the input to the MLP heads