MAE

papers

[MAE] Masked Autoencoders Are Scalable Vision Learners:恺明,将BERT的掩码自监督模式搬到图像领域,设计基于masked patches的图像重建任务

[VideoMAE] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training:腾讯AI Lab,进一步搬运到video领域

Masked Autoencoders Are Scalable Vision Learners

  1. 动机

    • 一种自监督训练(pretraining)方式,用来提升模型泛化性能
    • 技术方案:
      • mask & reconstruct
      • encoder-decoder architecture
        • encoder operates only on visible patches:首先对input patches做random sampling,只选取少量patches给到encoder
        • lightweight decoder run reconstruction on (masked) tokens:将encoded patches和mask tokens组合,给到decoder,用于重建原始图像
    • 精度
      • with MAE pretraining,ViT-Huge on ImageNet-1k:87.8%
  2. 论点

    • 自监督路线能给到模型更大体量的数据,like NLP,masked autoencoding也是经典的BERT训练方式,but现实是autoencoding methods in vision lags behind NLP

      • information density:NLP是通过提取高级语义信息去推断空缺的,而图像如果有充足的邻里低级空间信息,就能重建出来不错的效果,导致模型没学到啥高级语义信息就收敛了,本文的解决方案是random mask极大比例的patches,largely reduce redundancy
      • decoder plays a different role be- tween reconstructing text and images:和上一条呼应,visual decoder重建的是像素,低级信息,NLP decoder重建的是词向量,是高级表征,因此BERT用了个贼微小的结构来建模decoder——一个MLP,但是图像这边decoder的设计就重要的多——plays a key role in determining the semantic level of the learned latent representations
    • our MAE

      • 不对称encoder-decoder
      • high portion masking:既提升acc又减少计算量,easy to scale-up
      • workflow

        • MAE pretraing:encode random sampled patches,decode encoded&masked tokens
        • down stream task:save encoder for recognition tasks

  3. 方法

    • masking
      • 切分图像:non-overlapping patches
      • 随机采样:random sample the patches following a uniform distribution
      • high masking ratio:一定要构建这样的task,不能简单通过邻里低级信息恢复出来,必须要深入挖掘高级语义信息,才能推断出空缺是啥
    • MAE encoder
      • ViT
      • given patches:linear proj + PE
      • operates on a small visible subset(25%) of the full set
      • 负责recognition任务
    • MAE decoder
      • a series of Transformer blocks:远小于encoder,narrower and shallower,单个token的计算量是encoder的<10%
      • given full set tokens
        • mask tokens:a shared & learned vector 用来表征missing patches
        • add PE:从而区别不同的mask token
      • 负责reconstruction任务
    • Reconstruction target
      • decoder gives the full set reconstructed tokens:[b,N,D]
        • N:patch sequence length
        • D:patch pixel values
      • reshape:[b,H,W,C]
      • 重建loss,per-pixel MSE:compute only on masked patches
      • 【QUESTION,这个还没理解】还有一个变体,官方代码里叫norm_pix_loss,声称是for better representation learning,以每个patch的norm作为target:
        • 对每个masked patch,计算mean&std,
        • 然后normalize,
        • 这个normed patch作为reconstruction target