FAN:full attention networks

发表于 2022-06-03 |

paper: https://arxiv.org/abs/2204.12451

repo: https://github.com/NVlabs/FAN

Understanding The Robustness in Vision Transformers

动机
- ViT展现出了strong robustness（图像出现各类corruption的时候仍旧保持特征提取能力），但是缺少systematic understanding
- this paper
  - examine the role of self-attention
  - further propose a family of fully attentional networks (FANs) that strengthen its capability
- verified state-of-the-art on
  - ImageNet-1k：acc 87.1%
  - downstream tasks：semantic segmentation and object detection
论点
- ViT针对non-local relations的建模被认为是the robustness against various corruptions的重要要素
- 但是convNext用纯卷积网络也达到了相当的精度，raising the interest in the actual role of self-attention in robust generalization
- our approach
  - 首先是观察：
    - 在执行分类任务的时候，网络天然地做出了目标的分割——self-attention promotes mid-level representations
    - 对每层ViT的输出tokens做spectral clustering，观察到了极大的特征值
    - 发现极大特征值的数量与input perturbation是存在相关性的——这两个指标在mid-level layers中会显著降低，从而保持robustness，同时 indicates the symbiosis of grouping
  - 然后进一步探究这个grouping phenomenon
    - 发现self-attention类似于information bottleneck (IB)，这个暂时不理解
  - 最后是对transformer结构的改动
    - 原始的ViT block，用MSA（multi heads）去提取不同特征，然后用MLP去整合
    - 本文在融合的时候，加了一个channel attention