paper: https://arxiv.org/abs/2204.12451
repo: https://github.com/NVlabs/FAN
Understanding The Robustness in Vision Transformers
- 动机
- ViT展现出了strong robustness(图像出现各类corruption的时候仍旧保持特征提取能力),但是缺少systematic understanding
- this paper
- examine the role of self-attention
- further propose a family of fully attentional networks (FANs) that strengthen its capability
- verified state-of-the-art on
- ImageNet-1k:acc 87.1%
- downstream tasks:semantic segmentation and object detection
- 论点
- ViT针对non-local relations的建模被认为是the robustness against various corruptions的重要要素
- 但是convNext用纯卷积网络也达到了相当的精度,raising the interest in the actual role of self-attention in robust generalization
- our approach
- 首先是观察:
- 在执行分类任务的时候,网络天然地做出了目标的分割——self-attention promotes mid-level representations
- 对每层ViT的输出tokens做spectral clustering,观察到了极大的特征值
- 发现极大特征值的数量与input perturbation是存在相关性的——这两个指标在mid-level layers中会显著降低,从而保持robustness,同时 indicates the symbiosis of grouping
- 然后进一步探究这个grouping phenomenon
- 发现self-attention类似于information bottleneck (IB),这个暂时不理解
- 最后是对transformer结构的改动
- 原始的ViT block,用MSA(multi heads)去提取不同特征,然后用MLP去整合
- 本文在融合的时候,加了一个channel attention
- 首先是观察: