self-supervised

  • 自监督papers

    • MoCo系列:contrastive-based

      [2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning,kaiming

      [2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations,Google Brain,混进来是因为它improve based on MoCo v1,而MoCo v2/v3又都是基于它改进

      [2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning,kaiming

      [2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers,kaiming

    • MAE:reconstruct-based

      [2021 MAE] Masked Autoencoders Are Scalable Vision Learners:恺明,将BERT的掩码自监督模式搬到图像领域,设计基于masked patches的图像重建任务

    • MIM:reconstruct-based

      [2021 SimMIM] SimMIM: A Simple Framework for Masked Image Modeling:微软,swin v2的scale up模型用了这个自监督方法来缓解data hungary issue

      [2022 MIM] Revealing the Dark Secrets of Masked Image Modeling:微软,类似上一篇的展开实验part,

SimMIM: a Simple Framework for Masked Image Modeling

  1. 动机

    • propose a simple framework of MIM
      • without the need of special designs
      • simple designs revealed strong learning performance
    • major components
      • 使用较大patch size:random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task
      • 进行pixel-level的regression预测:predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs
      • 轻量的预测头:the prediction head can be as light as a linear layer, with no worse performance than heavier ones
    • proved on ImageNet
      • 普通模型ViT-B:pretraing+finetuning,ImageNet-1k,83.8% top-1
      • 大模型SwinV2-H:87.1% top-1
      • 超大模型SwinV2-G:MIM能够leverage the data-hungry issue,使用更少量的数据训练超大模型至SOTA
  2. 论点

    • 自监督里面的numerous pretext tasks

      • gray-scale image colorization:图像上色
      • jigsaw puzzle solving:打乱的patches重新排序
      • split-brain auto-encoding:图像分成两部分,两条分支,交叉预测对方
      • rotation prediction:given变换前后的图像X&Y,预测变换参数
      • learning to cluster:「特征聚类,将簇心标签赋给其成员,训练一个task,特征聚类」,迭代这个过程
    • Masked signal modeling

      • 图像和NLP的主要区别
        • images exhibit stronger locality:neighbor pixels就是high related的,language的词序则不存在必然的距离相关性
        • visual signals are raw and low-level:因此预测pixel-level的重建任务是否对high-level recognition task有增益?
        • the visual signal is continuous, and the text token is discrete
      • bridge the modality gaps through several special designs
        • converting continuous signals into color clusters
        • patch tokenization using an additional network
        • block-wise masking strategy to break short-range connections
    • this paper

      • propose a simple framework,无需上述复杂设计

      • Random masking:

        • patch-level的随机masking,适配vision transformer
        • 大一点的patch size(32) works for a wide range of masking ratio
        • 小的patch size(8) 需要masking ratio as high as 80% to perform well
        • NLP里面的masking ratio通常比较小,如0.15,我们认为是因为info-level不一致
      • 1-layer prediction head
        • extremely lightweight
        • 同时target resolution也不建议太大(12-96都可以好于192x192)
        • achieves sligtly better transferring performance than heavy heads
        • 这个自监督的头预训练完了要丢弃的,所以越小越好,不要过多承担模型能力
      • pixel-level reconstruction use simple L1 loss
        • regression比较适配continuous signal
        • performs no worse than classification approaches
        • 【QUESTION】分类任务一般怎么设计:后面实验里面,把RGB灰度值分解成8/256个bin,然后分类
  3. 方法

    • A Masked Image Modeling Framework:4 major components

      • masking strategy
      • encoder architecture:ViT & Swin
      • prediction head
      • prediction target:either the raw pixels or a transformation
    • Masking Strategy

      • mask token
        • use a learnable mask token vector to replace each masked patch
        • token dimension 和 visible那部分patch embedding一致
      • Patch-aligned random masking
        • 就是以patch为单位随机masking
        • swin的patch size是随阶段增长的,从4到32,we adopt 32
        • for ViT we adopt 32
      • Other masking strategies:用了16/32

        • square:随机放置的大方框
        • block-wise:复杂设计的

    • Prediction Head

      • as light as a linear layer
      • 实验也尝试过2-layer MLP、an inverse Swin-T、an inverse Swin-B 这种逐渐heavy的
      • 上采样 if required:
        • ViT编码得到x16的feature maps
        • Swin编码得到x32的feature maps
        • 用一个1x1 conv / linear,将feature dim扩展到patch size patch size 3,如swin-RGB就是32*32*3=3072
    • Prediction Targets

      • regression

        • 也可以考虑将grouth-truth降采样到feature size
        • L1 loss:计算masked区域RGB像素的L1 loss,然后mean on pixels
        • 实验也尝试了L2 / smoothL1
      • Other prediction targets

        • previous approaches大多数将masked signals转化成clusters or classes,然后perform a classification task
        • Color clustering(iGPT):将巨型dataset的RGB values聚类成512个cluster,每个预测pixel is assigned to最邻近的cluster
        • Vision tokenization(BEiT):用一个pretrained discrete VAE network将image patches转化成token,并作为classification target
        • Channel-wise bin color discretization:每个颜色通道独立分类,灰度值离散化为8/256 bins
    • Evaluation protocols

      • 首先将模型在imagenet1k上finetuning,然后看分类精度
      • 或者其他down-stream tasks的指标来评估
  4. 实验

    • pre-training settings

      • swinB:input 192x192,window size=6
      • dataset:ImageNet-1K,a light data augmentation (random resize cropping/random flipping /color normalization)
      • AdamW:weight decay=0.05,beta=[0.9,0.999]
      • cosine LR scheduler:100 epochs (warmup 10 ep),baseLR=8e-4
      • batch size:2048
      • random masking:mask ratio=0.6,mask patch size=32
    • fine-tuning settings

      • AdamW、batch size、masking 参数一致
      • cosine LR:baseLR=5e-3
      • a stochastic depth rate:0.1
      • a layer-wise learning rate decay:0.9
      • strong data augmentation:RandAug,Mixup,Cutmix,label smoothing,random erasing
    • AvgDist

      • measures the averaged Euclidean distance of masked pixels to the nearest visible ones:被遮挡的patch embedding与其最近的visible patch的embedding欧几里得距离
      • mask ratio越大,AvgDist越大
      • mask patch size越大,AvgDist越大
      • AvgDist的值在[10,20]区间时,模型的精度最高
    • 一些精度记录

      • SwinV2-H achieves 87.1% top-1 accuracy,在只使用ImageNet-1K数据集里面精度最佳
      • SwinV2-G借助了外部数据,但比google用的少,40× smaller,achieves strong performance
        • 84.0% top-1 on ImageNet-V2
        • 63.1/54.4 box/mask mAP on COCO object detection
        • 59.9 mIoU on ADE20K semantic segmentation
        • 86.8% top-1 acc on Kinetics-400 action recognition
  5. Visualization

    • 【20220614】目前初步实验下来,预训练的生成模型,生成的图片会呈现明显的棋盘格,因为每个x32的feature pixel代表了一个32x32的RGB patch,官方论文里面的图也很棋盘格,不知道该训练到啥程度算结束
    • What capability is learned?

      • random masking:the shape and texture of masked parts can be well recovered,以及unmasked区域会观察到显著棋盘格效应,因为这部分区域在训练过程中是不回传梯度的
      • masking most parts of a major object:can still predict an existence of object by the negligible clues
      • masking the full major object:the masked area will be inpainted with background textures
    • Prediction v.s. reconstruction

      • 比较了masked region recover和全图recover两个任务
      • 从重建结果上看,后者视觉效果更好一点(棋盘格没那么明显,因为是全局预测),但是精度则低了一个点:probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning
      • auto-encoders and masked image modeling两个方法都是重建任务,but they are built on different philosophies:

        • 前者是visible signal reconstruction
        • 后者是prediction of invisible signals
      • MIM也可以设计成全图重建,但这相当于融合了两个任务

        • prediction & reconstruction
        • 从finetuning精度上看two tasks are fundamentally different in their internal mechanisms,两个任务的内部机制不同,合起来做不会促进
        • the task to predict might be a more promising representation:看起来prediction任务学到的representation对下游任务更有用一些
      • 【个人理解重建任务更local一点,所以细节更好看,prediction任务更long-range一些,但是为什么有说对low-level/fine-grained downstream task更好呢?】

Revealing the Dark Secrets of Masked Image Modeling

  1. 动机

    • Masked image modeling (MIM) as pre-training
      • proved effective
      • but how it works remains unclear
    • we compare MIM with mainstream supervised models
      • through visualizations & experiments
      • to cover the key representational differents
    • visualizations
      • 发现MIM brings locality inductive bias to all layers:信息流更不丢东西?
      • 相比之下supervised models在lower layers更关注局部信息,在higher layers则更关注全局信息
      • supervised models在last layers的attention head基本没啥差别(都是global semantic info),但是MIM在last layers仍旧能够keep diversity on attention heads
      • less diversity harms the fine-tuning performance
    • experiments
      • 发现MIM相比较于supervised models更胜任tasks with weak semantics / fine-grained tasks
      • 【猜测】image-level label 驱动 pixel-level study的效果更好?
  2. 论点

    • masked signal modeling

      • mask a portion of input signals and tries to predict them
      • 属于比较经典的recover-based自监督任务设计
      • language, vision, and speech场景都有在用
    • masked image modeling (MIM)

      • achieve very high fine-tuning accuracy
      • thus this paper wants a deeper understanding
      • we use SimMIM framework:就是基于ViT/swin-back+light-weight head重建pixel-level图像的任务

        • random masking with large patch size
        • a-linear-layer prediction head
        • predict raw RGB pixels use L1 loss

  3. Visualizations

    • attention weights have a clear meaning:每个token比重多大

    • 从三个方面来分析

      • averaged attention distance to measure whether it is local attention or global attention
      • entropy of attention distribution to measure whether it is focused attention or broad attention:这跟上面不一个意思吗
      • KL divergence of different attention heads to investigate that attention heads are attending different tokens or similar ones
    • Local Attention or Global Attention

      • 图像信息带有strong locality:neighbor pixels天然地highly correlated,所以才会有conv这样的带有local priors的设计,但是transformer结构有没有这种inductive bias就值得讨论了

      • computing averaged attention distance in each attention head of each layer

        • constastive模型与supervised模型表现类似,lower layer focus locally,higher layers focus more globally
        • MIM模型每层的attention heads则表现的充满diversity,始终保有local & global pixels
        • 说明MIM brings locality inductive bias【不太理解】
    • Focused Attention or Broad Attention

      • averaging the entropy of each head’s attention distribution

        • constastive模型与supervised模型表现类似,lower layer的一些attention heads有非常focused attention,大部分higher layers的attention heads则focus very broadly
        • MIM模型每层都很diverse,每层都兼顾了focused attention & broad attention
    • Diversity on Attention Heads

      • 看每个attention head关注的token是否相似

      • computing the KL divergence between different heads

        • constastive模型与supervised模型表现类似,diversity逐渐变小,最后几层甚至都没了
        • losing diversity limits the capacity of the model:损害了模型表达能力
        • 去掉supervised模型的后面几层去进行下游任务精度会保持甚至提升,说明supervised pretrained model后面几层确实对下游任务有负面影响
    • Investigating the Representation Structures via CKA similarity

      • 前面都是在看同层不同attention heads,这里观察不同层的feature maps

      • via the CKA similarity between feature representations of different layers

        • MIM和constastive模型表现类似,每层的feature representation structures高度相似
        • supervised模型则每层差异比较大
        • 给这些预训练模型加载权重的时候随机调换一些层进行下游任务,MIM只有轻微掉点,但是supervised会受影响更大
    • Experiments

      • on 3 types of downstream tasks

        • semantic understanding tasks:classification,Concept Generalization (CoG) & 12-dataset (K12) & iNaturalist-18 (iNat18)
        • geometric and motion tasks:pose/depth estimation/video tracking,COCO & CrowdPose & NYUv2
        • combined tasks:object detection,COCO
      • Semantic Understanding Tasks

        • 用了三个数据集,从ImageNet pretrained去transfer
        • settings
          • AdamW
          • cosine learning rate schedule
          • 100 epochs with 20 warm-up
          • input 224x224
          • DropPath
        • 发现ImageNet cover的类别supervised模型会好于MIM模型,没cover的类/fine-grained的类都是MIM精度更高,说明MIM的 representation power的transfer能力更强

      • Geometric and Motion Tasks

        • 主要测试目标定位能力,不太关注高级语义信息
        • 全面超越
      • Combined Task of Object Detection

        • COCO目标检测
        • Mask-RCNN framework
        • 也是clearly outperform
        • 然后观察到MIM模型的定位task收敛的faster and better,supervised模型则对分类能力更有用,也说明了MIM更专注geometric and motion tasks