self-supervised

自监督papers
- MoCo系列：contrastive-based
  
  [2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning，kaiming
  
  [2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations，Google Brain，混进来是因为它improve based on MoCo v1，而MoCo v2/v3又都是基于它改进
  
  [2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning，kaiming
  
  [2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers，kaiming
- MAE：reconstruct-based
  
  [2021 MAE] Masked Autoencoders Are Scalable Vision Learners：恺明，将BERT的掩码自监督模式搬到图像领域，设计基于masked patches的图像重建任务
- MIM：reconstruct-based
  
  [2021 SimMIM] SimMIM: A Simple Framework for Masked Image Modeling：微软，swin v2的scale up模型用了这个自监督方法来缓解data hungary issue
  
  [2022 MIM] Revealing the Dark Secrets of Masked Image Modeling：微软，类似上一篇的展开实验part，

SimMIM: a Simple Framework for Masked Image Modeling

动机
- propose a simple framework of MIM
  - without the need of special designs
  - simple designs revealed strong learning performance
- major components
  - 使用较大patch size：random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task
  - 进行pixel-level的regression预测：predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs
  - 轻量的预测头：the prediction head can be as light as a linear layer, with no worse performance than heavier ones
- proved on ImageNet
  - 普通模型ViT-B：pretraing+finetuning，ImageNet-1k，83.8% top-1
  - 大模型SwinV2-H：87.1% top-1
  - 超大模型SwinV2-G：MIM能够leverage the data-hungry issue，使用更少量的数据训练超大模型至SOTA
论点
- 自监督里面的numerous pretext tasks
  - gray-scale image colorization：图像上色
  - jigsaw puzzle solving：打乱的patches重新排序
  - split-brain auto-encoding：图像分成两部分，两条分支，交叉预测对方
  - rotation prediction：given变换前后的图像X&Y，预测变换参数
  - learning to cluster：「特征聚类，将簇心标签赋给其成员，训练一个task，特征聚类」，迭代这个过程
- Masked signal modeling
  - 图像和NLP的主要区别
    - images exhibit stronger locality：neighbor pixels就是high related的，language的词序则不存在必然的距离相关性
    - visual signals are raw and low-level：因此预测pixel-level的重建任务是否对high-level recognition task有增益？
    - the visual signal is continuous, and the text token is discrete
  - bridge the modality gaps through several special designs
    - converting continuous signals into color clusters
    - patch tokenization using an additional network
    - block-wise masking strategy to break short-range connections
- this paper
  - propose a simple framework，无需上述复杂设计
  - Random masking：
    - patch-level的随机masking，适配vision transformer
    - 大一点的patch size(32) works for a wide range of masking ratio
    - 小的patch size(8) 需要masking ratio as high as 80% to perform well
    - NLP里面的masking ratio通常比较小，如0.15，我们认为是因为info-level不一致
  - 1-layer prediction head
    - extremely lightweight
    - 同时target resolution也不建议太大（12-96都可以好于192x192）
    - achieves sligtly better transferring performance than heavy heads
    - 这个自监督的头预训练完了要丢弃的，所以越小越好，不要过多承担模型能力
  - pixel-level reconstruction use simple L1 loss
    - regression比较适配continuous signal
    - performs no worse than classification approaches
    - 【QUESTION】分类任务一般怎么设计：后面实验里面，把RGB灰度值分解成8/256个bin，然后分类
方法
- A Masked Image Modeling Framework：4 major components
  - masking strategy
  - encoder architecture：ViT & Swin
  - prediction head
  - prediction target：either the raw pixels or a transformation
- Masking Strategy
  - mask token
    - use a learnable mask token vector to replace each masked patch
    - token dimension 和 visible那部分patch embedding一致
  - Patch-aligned random masking
    - 就是以patch为单位随机masking
    - swin的patch size是随阶段增长的，从4到32，we adopt 32
    - for ViT we adopt 32
  - Other masking strategies：用了16/32
    - square：随机放置的大方框
    - block-wise：复杂设计的
- Prediction Head
  - as light as a linear layer
  - 实验也尝试过2-layer MLP、an inverse Swin-T、an inverse Swin-B 这种逐渐heavy的
  - 上采样 if required：
    - ViT编码得到x16的feature maps
    - Swin编码得到x32的feature maps
    - 用一个1x1 conv / linear，将feature dim扩展到patch size patch size 3，如swin-RGB就是32*32*3=3072
- Prediction Targets
  - regression
    - 也可以考虑将grouth-truth降采样到feature size
    - L1 loss：计算masked区域RGB像素的L1 loss，然后mean on pixels
    - 实验也尝试了L2 / smoothL1
  - Other prediction targets
    - previous approaches大多数将masked signals转化成clusters or classes，然后perform a classification task
    - Color clustering（iGPT）：将巨型dataset的RGB values聚类成512个cluster，每个预测pixel is assigned to最邻近的cluster
    - Vision tokenization（BEiT）：用一个pretrained discrete VAE network将image patches转化成token，并作为classification target
    - Channel-wise bin color discretization：每个颜色通道独立分类，灰度值离散化为8/256 bins
- Evaluation protocols
  - 首先将模型在imagenet1k上finetuning，然后看分类精度
  - 或者其他down-stream tasks的指标来评估
实验
- pre-training settings
  - swinB：input 192x192，window size=6
  - dataset：ImageNet-1K，a light data augmentation (random resize cropping/random flipping /color normalization)
  - AdamW：weight decay=0.05，beta=[0.9,0.999]
  - cosine LR scheduler：100 epochs (warmup 10 ep)，baseLR=8e-4
  - batch size：2048
  - random masking：mask ratio=0.6，mask patch size=32
- fine-tuning settings
  - AdamW、batch size、masking 参数一致
  - cosine LR：baseLR=5e-3
  - a stochastic depth rate：0.1
  - a layer-wise learning rate decay：0.9
  - strong data augmentation：RandAug，Mixup，Cutmix，label smoothing，random erasing
- AvgDist
  - measures the averaged Euclidean distance of masked pixels to the nearest visible ones：被遮挡的patch embedding与其最近的visible patch的embedding欧几里得距离
  - mask ratio越大，AvgDist越大
  - mask patch size越大，AvgDist越大
  - AvgDist的值在[10,20]区间时，模型的精度最高
- 一些精度记录
  - SwinV2-H achieves 87.1% top-1 accuracy，在只使用ImageNet-1K数据集里面精度最佳
  - SwinV2-G借助了外部数据，但比google用的少，40× smaller，achieves strong performance
    - 84.0% top-1 on ImageNet-V2
    - 63.1/54.4 box/mask mAP on COCO object detection
    - 59.9 mIoU on ADE20K semantic segmentation
    - 86.8% top-1 acc on Kinetics-400 action recognition
Visualization
- 【20220614】目前初步实验下来，预训练的生成模型，生成的图片会呈现明显的棋盘格，因为每个x32的feature pixel代表了一个32x32的RGB patch，官方论文里面的图也很棋盘格，不知道该训练到啥程度算结束
- What capability is learned?
  - random masking：the shape and texture of masked parts can be well recovered，以及unmasked区域会观察到显著棋盘格效应，因为这部分区域在训练过程中是不回传梯度的
  - masking most parts of a major object：can still predict an existence of object by the negligible clues
  - masking the full major object：the masked area will be inpainted with background textures
- Prediction v.s. reconstruction
  - 比较了masked region recover和全图recover两个任务
  - 从重建结果上看，后者视觉效果更好一点(棋盘格没那么明显，因为是全局预测)，但是精度则低了一个点：probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning
  - auto-encoders and masked image modeling两个方法都是重建任务，but they are built on different philosophies：
    - 前者是visible signal reconstruction
    - 后者是prediction of invisible signals
  - MIM也可以设计成全图重建，但这相当于融合了两个任务
    - prediction & reconstruction
    - 从finetuning精度上看two tasks are fundamentally different in their internal mechanisms，两个任务的内部机制不同，合起来做不会促进
    - the task to predict might be a more promising representation：看起来prediction任务学到的representation对下游任务更有用一些
  - 【个人理解重建任务更local一点，所以细节更好看，prediction任务更long-range一些，但是为什么有说对low-level/fine-grained downstream task更好呢？】

Revealing the Dark Secrets of Masked Image Modeling

动机
- Masked image modeling (MIM) as pre-training
  - proved effective
  - but how it works remains unclear
- we compare MIM with mainstream supervised models
  - through visualizations & experiments
  - to cover the key representational differents
- visualizations
  - 发现MIM brings locality inductive bias to all layers：信息流更不丢东西？
  - 相比之下supervised models在lower layers更关注局部信息，在higher layers则更关注全局信息
  - supervised models在last layers的attention head基本没啥差别（都是global semantic info），但是MIM在last layers仍旧能够keep diversity on attention heads
  - less diversity harms the fine-tuning performance
- experiments
  - 发现MIM相比较于supervised models更胜任tasks with weak semantics / fine-grained tasks
  - 【猜测】image-level label 驱动 pixel-level study的效果更好？
论点
- masked signal modeling
  - mask a portion of input signals and tries to predict them
  - 属于比较经典的recover-based自监督任务设计
  - language, vision, and speech场景都有在用
- masked image modeling (MIM)
  - achieve very high fine-tuning accuracy
  - thus this paper wants a deeper understanding
  - we use SimMIM framework：就是基于ViT/swin-back+light-weight head重建pixel-level图像的任务
    - random masking with large patch size
    - a-linear-layer prediction head
    - predict raw RGB pixels use L1 loss
Visualizations
- attention weights have a clear meaning：每个token比重多大
- 从三个方面来分析
  - averaged attention distance to measure whether it is local attention or global attention
  - entropy of attention distribution to measure whether it is focused attention or broad attention：这跟上面不一个意思吗
  - KL divergence of different attention heads to investigate that attention heads are attending different tokens or similar ones
- Local Attention or Global Attention
  - 图像信息带有strong locality：neighbor pixels天然地highly correlated，所以才会有conv这样的带有local priors的设计，但是transformer结构有没有这种inductive bias就值得讨论了
  - computing averaged attention distance in each attention head of each layer
    - constastive模型与supervised模型表现类似，lower layer focus locally，higher layers focus more globally
    - MIM模型每层的attention heads则表现的充满diversity，始终保有local & global pixels
    - 说明MIM brings locality inductive bias【不太理解】
- Focused Attention or Broad Attention
  - averaging the entropy of each head’s attention distribution
    - constastive模型与supervised模型表现类似，lower layer的一些attention heads有非常focused attention，大部分higher layers的attention heads则focus very broadly
    - MIM模型每层都很diverse，每层都兼顾了focused attention & broad attention
- Diversity on Attention Heads
  - 看每个attention head关注的token是否相似
  - computing the KL divergence between different heads
    - constastive模型与supervised模型表现类似，diversity逐渐变小，最后几层甚至都没了
    - losing diversity limits the capacity of the model：损害了模型表达能力
    - 去掉supervised模型的后面几层去进行下游任务精度会保持甚至提升，说明supervised pretrained model后面几层确实对下游任务有负面影响
- Investigating the Representation Structures via CKA similarity
  - 前面都是在看同层不同attention heads，这里观察不同层的feature maps
  - via the CKA similarity between feature representations of different layers
    - MIM和constastive模型表现类似，每层的feature representation structures高度相似
    - supervised模型则每层差异比较大
    - 给这些预训练模型加载权重的时候随机调换一些层进行下游任务，MIM只有轻微掉点，但是supervised会受影响更大
- Experiments
  - on 3 types of downstream tasks
    - semantic understanding tasks：classification，Concept Generalization (CoG) & 12-dataset (K12) & iNaturalist-18 (iNat18)
    - geometric and motion tasks：pose/depth estimation/video tracking，COCO & CrowdPose & NYUv2
    - combined tasks：object detection，COCO
  - Semantic Understanding Tasks
    - 用了三个数据集，从ImageNet pretrained去transfer
    - settings
      - AdamW
      - cosine learning rate schedule
      - 100 epochs with 20 warm-up
      - input 224x224
      - DropPath
    - 发现ImageNet cover的类别supervised模型会好于MIM模型，没cover的类/fine-grained的类都是MIM精度更高，说明MIM的 representation power的transfer能力更强
  - Geometric and Motion Tasks
    - 主要测试目标定位能力，不太关注高级语义信息
    - 全面超越
  - Combined Task of Object Detection
    - COCO目标检测
    - Mask-RCNN framework
    - 也是clearly outperform
    - 然后观察到MIM模型的定位task收敛的faster and better，supervised模型则对分类能力更有用，也说明了MIM更专注geometric and motion tasks