MoCo系列

papers:

[2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning,kaiming

[2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations,Google Brain,混进来是因为它improve based on MoCo v1,而MoCo v2/v3又都是基于它改进

[2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning,kaiming

[2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers,kaiming

preview: 自监督学习 Self-supervised Learning

  1. reference:https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html

  2. overview

    • 就是无监督
    • 针对的痛点(有监督训练模型)
      • 标注成本高
      • 迁移性差
    • 会基于数据特点,设置Pretext tasks(最常见的任务就是生成/重建),构造Pesdeo Labels来训练网络
    • 通常模型用来作为其他学习任务的预训练模型
    • 被认为是用来学习图像的通用视觉表示
  3. methods

    • 从结构上区分主要就是两大类方法

      • 生成式:通过encoder-decoder结构还原输入,监督信号是输入输出尽可能相似
        • 重建任务开销大
        • 没有建立直接的语义学习
        • 外加GAN的判别器使得任务更加复杂难训
      • 判别式:输入两张图片,通过encoder编码,监督信号是判断两张图是否相似,判别式模型也叫Contrastive Learning

    • 从Pretext tasks上划分主要分为三类

      • 基于上下文(Context based) :如bert的MLM,在句子/图片中随机扣掉一部分,然后推动模型基于上下文/语义信息预测这部分/相对位置关系
      • 基于时序(Temporal Based):如bert的NSP,视频/语音,利用相邻帧的相似性,构建不同排序的序列,判断B是否是A的下一句/是否相邻帧
      • 基于对比(Contrastive Based):比较正负样本,最大化相似度的loss在这里面被叫做InfoNCE
  4. memory-bank

    • Contrastive Based方法最常见的方式是在一个batch中构建正负样本进行对比学习
      • end-to-end
      • 每个mini-batch中的图像增强前后的两张图片互为正样本
      • 字典大小就是minibatch大小
    • memory bank包含数据集中所有样本编码后特征
      • 随机采样一部分作为keys
      • 每个迭代只更新被采样的样本编码
      • 因为样本编码来自不同的training step,一致性差
    • MoCo

      • 动态编码库:out-of-date的编码出列
      • momentum update:一致性提升

  5. InfoNCE

    • deep mind在CPC(Contrastive Predictive Coding)提出,论文以后有机会再展开

      • unsupervised
      • encoder:encode x into latent space representations z,resnet blocks
      • autoregressive model:summarize each time-step set of {z} into a context representation c,GRUs
      • probabilistic contrastive loss

        • Noise-Contrastive Estimation
        • Importance Sampling

    • 训练目标是输入数据x和context vector c之间的mutual information

      • 每次从$p(x_{t+k}|c_t)$中采样一个正样本:正样本是这个序列接下来预测的东西,和c的相似性肯定要高于不想干的token
      • 从$p(x_{t+k})$中采样N-1个负样本:负样本是别的序列里面随机采样的东西
      • 目标是让正样本与context相关性高,负样本低

MoCo v1: Momentum Contrast for Unsupervised Visual Representation Learning

  1. 动机

    • unsupervised visual representation learning
    • contrastive learning
    • dynamic dictionary

      • large
      • consisitent
    • verified on

      • 7 down-stream tasks
      • ImageNet classification
      • VOC & COCO det/seg
  2. 论点

    • Unsupervised representation learning

      • highly successful in NLP,in CV supervised is still the main-stream
      • 两个核心
        • pretext tasks
        • loss functions
      • loss functions
        • 生成式方法的loss是基于prediction和一个fix target来计算的
        • contrastive-based的key target则是vary on-the-fly during training
        • Adversarial losses没展开
      • pretext tasks
        • tasks involving recover:auto-encoder
        • task involving pseudo-labels:通常有个exemplar/anchor,然后计算contrastive loss
      • contrastive learning VS pretext tasks
        • 大量pretext tasks可以通过设计一些contrastive loss来实现
    • recent approaches using contrastive loss

      • dynamic dictionaries
        • 由keys组成:sampled from data & represented by an encoder
      • train the encoder to perform dictionary look-up
        • given an encoded query
        • similar to its matching key and dissimilar to others
    • desirable dictionary

      • large:better sample
      • consistent:training target consistent
    • MoCo:Momentum Contrast

      • queue
      • 每个it step的mini-batch的编码入库
      • the oldest are dequeued
      • EMA:

        • a slowly progressing key encoder
        • momentum-based moving average of the query encoder

      • similar的定义:q & k are from the same image

  3. 方法

    • contrastive learning

      • a encoded query $q$
      • a set of encoded samples $\{k_0, k_1, …\}$
      • assume:there is a single key $k_+$ in the dictionary that $q$ matches
      • similarity measurement:dot product
      • InfoNCE:
        • $L_q = -log \frac{exp(qk_+/\tau)}{\sum_0^K exp(qk/\tau)}$
        • 1 positive & K negtive samples
        • 本质上是个softmax-based classifier,尝试将$q$分类成$k_+$
      • unsupervised workflow
        • with a encoder network $f_q$ & $f_k$
        • thus we have query & sample representation $q=f_q(x^q)$ & $k=f_k(x^k)$
        • inputs $x$ can be images/patches/context(patches set)
        • $f_q$ & $f_k$ can be identical/partially shared/different
    • momentum contrast

      • dictionary as a key

        • the dictionary always represents a sampled subset of all data
        • the current mini-batch入列
        • the oldest mini-batch出列
      • momentum update

        • large dictionary没法对keys进行back-propagation:因为sample太多了

        • only $f_q$ are updated by back-propagation:mini-batch

        • naive solution:copy $f_q$的参数给$f_k$,yields poor results,因为key encoder参数变化太频繁了,representation inconsistent issue

        • momentum update:$f_k = mf_k + (1-m)f_q$,$m=0.999$

        • 三种更新方式对比

          • 第一种end-to-end method:
            • use samples in current mini-batch as the dictionary
            • keys are consistently encoded
            • dictionary size is limited
          • 第二种memory bank
            • A memory bank consists of the representations of all samples in the dataset
            • the dictionary for each mini-batch is randomly sampled from the memory bank,不进行bp,thus enables large dictionary
            • key representation is updated when it was last seen:inconsistent
            • 有些也用momentum update,但是是用在representation上,而不是encoder参数
      • pretext task

        • define positive pair:if the query and the key come from the same image
        • 我们从图上take two random views under random augmentation to form a positive pair
        • 然后用各自的encoder编码成q & k
        • 每一对计算similarity:pos similarity
        • 然后再计算input queries和dictionary的similarity:neg similarity
        • 计算ce,update $f_q$
        • 用$f_q$ update $f_k$
        • 把k加入dictionary队列
        • 把最早的mini-batch出列

        • 技术细节

          • resnet:last fc dim=128,L2 norm
          • temperature $\tau=0.07$
          • augmentation
            • random resize + random(224,224) crop
            • random color jittering
            • random horizontal flip
            • random grayscale conversion
          • shuffling BN
            • 实验发现使用resnet里面的BN会导致不好的结果:猜测是intra-batch communication引导模型学习了一种cheating的low-loss solution
            • 具体做法是给$f_k$的输入mini-batch先shuffle the order,然后进行fp,然后再shuffle back,这样$f_q$和$f_k$的BN计算的mini-batch的statics就是不同的
  4. 实验

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

  1. 动机

    • simplify recently proposed contrastive self-supervised learning algorithms
    • systematically study the major components
      • data augmentations
      • learnable unlinear prediction head
      • larger batch size and more training steps
    • outperform previous self-supervised & semi-supervised learning methods on ImageNet

  2. 论点

    • discriminative approaches based on contrastive learning

      • maximizing agreement between differently augmented views of the same data sample
      • via a contrastive loss in the latent space
    • major components & conclusions

      • 数据增强很重要,unsupervised比supervised benefits more
      • 引入的learnable nonlinear transformation提升了representation quality
      • contrastive cross entropy loss受益于normalized embedding和adjusted temperature parameter
      • larger batch size and more training steps很重要,unsupervised比supervised benefits more
  3. 方法

    • common framework

      • 4 major components

        • 随机数据增强
          • results in two views of the same sample,构成positive pair
          • crop + resize back + color distortions + gaussian blur
        • base encoder
          • 用啥都行,本文用了resnet including the GAP
        • a projection head
          • 将representation dim映射到the space where contrastive loss is applied(given 1 pos pair & N neg pair,就是N+1 dim)
          • 之前有方法直接用linear projection
          • 我们用了带一个hidden layer的MLP:fc-bn-relu-fc
        • a contrastive loss
      • overall workflow

        • random sample a minibatch of N
        • random augmentation results in 2N data points
        • 对每个样本来讲,有1个positive pair,其余2(N-1)个data points都是negative samples
        • set cosine similarity $sim(u,v)=u^Tv/|u||v|$
        • given positive pair $(i,j)$ then the loss is $l_{i,j} = -log \frac{exp(s_{i,j}/\tau)}{\sum_{k\neq i}^{2N} exp(s_{i,k}/\tau)}$
        • 对每个positive pair都计算,包括$(i,j)$ 和$(j,i)$,叫那个symmetrized loss
        • update encoder

    • training with large batch size

      • batch 8192,negatives 16382
      • 大batch时,linear learning rate scaling可能不稳定,所以用了LARS optmizer
      • global BN,aggregate BN mean & variance over all devices
      • TPU

MoCo v2: Improved Baselines with Momentum Contrastive Learning

  1. 动机

    • still working on contrastive unsupervised learning
    • simple modifications on MoCo
      • introduce two effective SimCLR’s designs:
      • an MLP head
      • more data augmentation
      • requires smaller batch size than SimCLR,making it possible to run on GPU
    • verified on
      • ImageNet classification
      • VOC detection
  2. 论点

    • MoCo & SimCLR
      • contrastive unsupervised learning frameworks
      • MoCo v1 shows promising
      • SimCLR further reduce the gap
      • we found two design imrpovements in SimCLR 在两个方法中都work,而且用在MoCo中shows better transfer learning results
        • an MLP projection head
        • stronger data augmentation
      • 同时MoCo framework相比较于SimCLR ,远不需要large training batches
        • SimCLR based on end-to-end mechanism,需要比较大的batch size,来提供足够多的negative pair
        • MoCo则用了动态队列,所以不限制batch size
    • SimCLR
      • improves the end-to-end method
      • larger batch:to provide more negative samples
      • output layer:replace fc with a MLP head
      • stronger data augmentation
    • MoCo

      • a large number of negative samples are readily available
      • 所以就把后两项引入进来了

  3. 方法

    • MLP head

      • 2-layer MLP(hidden dim=2048, ReLU)
      • 仅影响unsupervised training,有监督transfer learning的时候换头
      • temperature param调整:从default 0.07 调整成optimal value 0.2

    • augmentation

      • add blur
      • SimCLR还用了stronger color distortion:we found stronger color distortion in SimCLR hurts in our MoCo,所以没加
  4. 实验

    • ablation

      • MLP:在分类任务上的提升比检测大
      • augmentation:在检测上的提升比分类大

    • comparison

      • large batches are not necessary for good acc:SimCLR longer training那个版本精度更高
      • end-to-end的方法肯定more costly in memory and time:因为要bp两个encoder

MoCo v3: An Empirical Study of Training Self-Supervised Visual Transformers

  1. 动机

    • self-supervised frameworks that based on Siamese network, including MoCo
    • ViT:study the fundamental components for training self-supervised ViT
    • MoCo v3:an incremental improvement of MoCo v1/2,striking for a better balance of simplicity & accuracy & scalability
    • instability is a major issue
    • scaling up ViT models
      • ViT-Large
      • ViT-Huge
  2. 论点

    • we go back to the basics and investigate the fundamental components of training deep neural networks
      • batch size
      • learning rate
      • optmizer
    • instability
      • instability is a major issue that impacts self-supervised ViT training
      • but may not result in catastrophic failure,只会导致精度损失
      • 所以称之为hidden degradation
      • use a simple trick to improve stability:freeze the patch projection layer in ViT
      • and observes increasement in acc
    • NLP里面基于masked auto-encoding的framework效果要比基于contrastvie的framework好,图像正好反过来
  3. 方法

    • MoCo v3

      • take two crops for each image under random augmentation
      • encoded by two encoders $f_q$ & $f_k$ into vectors $q$ & $k$
      • we use the keys that naturally co-exist in the same batch
        • abandon the memory queue:因为发现batch size足够大(4096)的时候,memory queue就没啥acc gain了
        • 回归到batch-based sample pair
      • 但是encoder k仍旧不回传梯度,还是基于encoder q进行动量更新
      • symmetrized loss:

        • $ctr(q_1, k_2) + ctr(q_2,k_1)$
        • InfoNCE
        • temperature
        • 两个crops分别计算ctr

    • encoder

      • encoder $f_q$
        • a backbone
        • a projection head
        • an extra prediction head
      • encoder $f_k$
        • a backbone
        • a projection head
      • encoder $f_k$ is updated by the moving average of $f_q$,excluding the prediction head
    • baseline acc

      • basic settings,主要变动就是两个:
        • dynamic queue换成large batch
        • encoder $f_q$的extra prediction head
    • use ViT

      • 直接用ViT替换resnet back met instability issue

      • batch size

        • ViT里面的一个观点就是,model本身比较heavy,所以large batch is desirable

        • 实验发现

          • a batch of 1k & 2k produces reasonably smooth curves:In this regime, the larger batch improves accuracy thanks to more negative samples
          • a batch of 4k 有明显的untable dips:
          • a batch of 6k has worse failure patterns:我们解读为在跳水点,training is partially restarted and jumps out of the current local optimum

      • learning rate

        • lr较小,training比较稳定,但是容易欠拟合
        • lr过大,会导致unstable,也会影响acc
        • 总体来说精度还是决定于stability

      • optimizer

        • default adamW,batch size 4096
        • 有些方法用了LARS & LAMB for large-batch training
        • LAMB

          • sensitive to lr
          • optmal lr achieves slightly better accuracy than AdamW
          • 但是lr一旦过大,acc极速drop
          • 但是training curves still smooth,虽然中间过程有drop:我们解读为LAMB can avoid sudden change in the gradients,但是避免不了negative compact,还是会累加

      • a trick for improving stability

        • we found a spike in gradient causes a dip in the training curve
        • we also observe that gradient spikes happen earlier in the first layer (patch projection)
        • 所以尝试freezing the patch projection layer during training,也就是一个random的patch projection layer

          • This stability benefits the final accuracy
          • The improvement is bigger for a larger lr
          • 在别的ViT-back-framework上也有效(SimCLR、BYOL)

      • we also tried BN,WN,gradient clip

        • BN/WN does not improve
        • gradient clip在threshold足够小的时候有用,推到极限就是freezing了
    • implementation details

      • AdamW
      • batch size 4096
      • lr:warmup 40 eps then cosine decay
    • MLP heads

      • projection head:3-layers,4096-BN-ReLU-4096-BN-ReLU-256
      • prediction head:2-layers,4096-BN-ReLU-256
    • loss

      • ctr里面有个scale的参数,$2\tau$
      • makes it less sensitive to $\tau$ value
      • $\tau=0.2$
    • ViT architecture

      • 跟原论文保持一致
      • 输入是224x244的image,划分成16x16/14x14的patch sequence,project成256d/196d的embedding
      • 加上sine-cosine-2D的PE
      • 再concat一个cls token
      • 经过一系列transformer blocks
      • The class token after the last block (and after the final LayerNorm) is treated as the output of the backbone,and is the input to the MLP heads