FAN:full attention networks

paper: https://arxiv.org/abs/2204.12451

repo: https://github.com/NVlabs/FAN

Understanding The Robustness in Vision Transformers

  1. 动机
    • ViT展现出了strong robustness(图像出现各类corruption的时候仍旧保持特征提取能力),但是缺少systematic understanding
    • this paper
      • examine the role of self-attention
      • further propose a family of fully attentional networks (FANs) that strengthen its capability
    • verified state-of-the-art on
      • ImageNet-1k:acc 87.1%
      • downstream tasks:semantic segmentation and object detection
  2. 论点
    • ViT针对non-local relations的建模被认为是the robustness against various corruptions的重要要素
    • 但是convNext用纯卷积网络也达到了相当的精度,raising the interest in the actual role of self-attention in robust generalization
    • our approach
      • 首先是观察:
        • 在执行分类任务的时候,网络天然地做出了目标的分割——self-attention promotes mid-level representations
        • 对每层ViT的输出tokens做spectral clustering,观察到了极大的特征值
        • 发现极大特征值的数量与input perturbation是存在相关性的——这两个指标在mid-level layers中会显著降低,从而保持robustness,同时 indicates the symbiosis of grouping
      • 然后进一步探究这个grouping phenomenon
        • 发现self-attention类似于information bottleneck (IB),这个暂时不理解
      • 最后是对transformer结构的改动
        • 原始的ViT block,用MSA(multi heads)去提取不同特征,然后用MLP去整合
        • 本文在融合的时候,加了一个channel attention