transformers

startup

reference1：https://mp.weixin.qq.com/s/Rm899vLhmZ5eCjuy6mW_HA

reference2：https://zhuanlan.zhihu.com/p/308301901

NLP & RNN
- 文本涉及上下文关系
- RNN时序串行，建立前后关系
- 缺点：对超长依赖关系失效，不好并行化
NLP & CNN
- 文本是1维时间序列
- 1D CNN，并行计算
- 缺点：CNN擅长局部信息，卷积核尺寸和长距离依赖的balance
NLP & transformer
- 对流入的每个单词，建立其对词库的权重映射，权重代表attention
- 自注意力机制
- 建立长距离依赖
put in CV
- 插入类似的自注意力层
- 完全抛弃卷积层，使用Transformers
RNN & LSTM & GRU cell
- 标准要素：输入x、输出y、隐层状态h
- RNN
  - RNN cell每次接收一个当前输入$x_t$，和前一步的隐层输出$h_{t-1}$，然后产生一个新的隐层状态$h_t$，也是当前的输出$y_t$
  - formulation：$y_t, h_t = f(x_t, h_{t-1})$
  - same parameters for each time step：同一个cell每个time step的权重共享
  - 一个问题：梯度消失/爆炸
    - 考虑hidden states’ chain的简化形式：$h_t = \theta^t h_0$，一个sequence forward下去就是same weights multiplied over and over again
    - 另外tanh也是会让神经元梯度消失/爆炸
- LSTM
  - key ingredient
    - cell：增加了一条cell state workflow，优化梯度流
    - gate：通过门结构删选携带信息，优化长距离关联
  - 可以看到LSTM的循环状态有两个：细胞状态$c_t$和隐层状态$h_t$，输出的$y_t$仍旧是$h_t$
- GRU
  - LSTM的变体，仍旧是门结构，比LSTM结构简单，参数量小，据说更好训练
papers

[一个列了很多论文的主页] https://github.com/dk-liang/Awesome-Visual-Transformer

[经典考古]

* [Seq2Seq 2014] Sequence to Sequence Learning with Neural Networks，Google，最早的encoder-decoder stacking LSTM用于机翻

* [self-attention/Transformer 2017] Transformer: Attention Is All You Need，Google，

* [bert 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding，Google，NLP，输入single sentence/patched sentences，用Transformer encoder提取bidirectional cross sentence representation，用输出的第一个logit进行分类

[综述]

* [综述2020] Efficient Transformers: A Survey，Google，

* [综述2021] Transformers in Vision: A Survey，迪拜，

* [综述2021] A Survey on Visual Transformer，华为，

[classification]

* [ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE，Google，分类任务，用transformer的encoder替换CNN再加分类头，每个feature patch作为一个input embedding，channel dim是vector dim，可以看到跟bert基本一样，就是input sequence换成patch，后续基于它的提升有DeiT、LV-ViT

* [BotNet 2021] Bottleneck Transformers for Visual Recognition，Google，将CNN backbone最后几个stage替换成MSA

* [CvT 2021] CvT: Introducing Convolutions to Vision Transformers，微软，

* [Swin 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows，微软

* [PVT2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions，跟swin一样也是multi-scale features

[detection]

* [DeTR 2020] DeTR: End-to-End Object Detection with Transformers，Facebook，目标检测，CNN+transformer(en-de)+预测头，每个feature pixel作为一个input embedding，channel dim是vector dim

* [Deformable DETR]

* [Anchor DETR]

* 详见《det-transformers》

[segmentation]

* [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers，复旦，水，感觉就是把FCN的back换成transformer

[Unet+Transformer]：

* [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation，英伟达，直接使用transformer encoder做unet encoder

* [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation，encoder stream里面加transformer block

* [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation，大学，CNN feature和Transformer feature进行bifusion

* 详见《seg-transformers》

Sequence to Sequence

[a keras tutorial][https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html]
- general case
  - extract the information of the entire input sequence
  - then start generate the output sequence
- seq2seq model workflow
  - a (stacking of) RNN layer acts as encoder
    - processes the input sequence
    - returns its own internal state：不要RNN的outputs，只要internal states
    - encoder编码得到的东西叫Context Vector
  - a (stacking of) RNN layer acts as decoder
    - given previous characters of the target sequence
    - it is trained to predict the next characters of the target sequence
    - teacher forcing：
      - 输入是target sequence，训练目标是使模型输出offset by one timestep的target sequence
      - 也可以不teacher forcing：直接把预测作为next step的输入
    - Context Vector的同质性：每个step，decoder都读取一样的Context Vector作为initial_state
  - when inference
    - 第一步获取input sequence的state vectors
    - repeat
      - 给decoder输入input states和out sequence(begin with a 起始符)
      - 从prediction中拿到next character
    - append the character to the output sequence
    - until：得到end character / hit the character limit
- implementation
  
  https://github.com/AmberzzZZ/transformer/blob/master/seq2seq.py
one step further
- 改进方向
  - bi-directional RNN：粗暴反转序列，有效涨点
  - attention：本质是将encoder的输出Context Vector加权
  - ConvS2S：还没看
- 主要都是针对RNN的缺陷提出
动机
- present a general end-to-end sequence learning approach
  - multi-layered LSTMs
  - encode the input seq to a fix-dim vector
  - decode the target seq from the fix-dim vector
- LSTM did not have difficulty on long sentences
- reversing the order of the words improved performance
方法
- standard RNN
  - given a sequence $(x_1, x_2, …, x_T)$
  - iterating：
    $h_t = sigm(W^{hx} x_t + W^{hh}h_{t-1})\\ y_t = W^{yh} h_t$
  - 如果输入、输出的长度事先已知且固定，一个RNN网络就能建模seq2seq model了
  - 如果输入、输出的长度不同、并且服从一些更复杂的关系？就得用两个RNN网络，一个将input seq映射成fixed-sized vector，另一个将vector映射成output seq，but long-term-dependency issue
- LSTM
  - LSTM是始终带着全部seq的信息的，如上图那样
- our actual model
  - use two LSTMs：encoder-decoder能够增加参数量
  - an LSTM with four layers：deeper
  - input sequence倒序：真正的句首更接近trans的句首，makes it easy for SGD to establish communication
- training details
  - LSTM：4 layers，1000 cells
  - word-embedding：1000-dim，(input vocab 160,000, output vocab 80,000)
  - naive softmax
  - uniform initialization：(-0.08, 0.08)
  - SGD，lr=0.7，half by every half epoch，total 7.5 epochs
  - gradient norm [10, 25]
  - all sentences in a minibatch are roughly of the same length

Transformer: Attention Is All You Need

动机
- sequence2sequence models
  - encoder + decoder
  - RNN / CNN + an attention path
- we propose Transformer
  - base solely on attention mechanisms
  - more parallelizable and less training time
论点
- sequence modeling
  - 主流：RNN，LSTM，gated
    - align the positions to computing time steps
    - sequential本质阻碍并行化
  - Attention mechanisms acts as a integral part
    - in previous work used in conjunction with the RNN
  - 为了并行化
    - some methods use CNN as basic building blocks
    - difficult to learn dependencies between distant positions
- we propose Transformer
  - rely entirely on an attention mechanism
  - draw global dependencies
- self-attention
  - relating different positions of a single sequence
  - to generate a overall representation of the sequence
方法
- encoder-decoder
  - encoder：doc2emb
    - given an input sequence of symbol representation $(x_1, x_2, …, x_n)$
    - map to a sequence of continuous representations $(z_1, z_2, …, z_n)$，(embeddings)
  - decoder：hidden layers
    - given embeddings z
    - generate an output sequence $(y_1, y_2, …, y_m)$ one element at a time
    - the previous generated symbols are served as additional input when computing the current time step
- Transformer Architecture
  - Transformer use
    - for both encoder and decoder
    - stacked self-attention and point-wise fully-connected layers
  - encoder
    - N=6 identical layers
    - each layer has 2 sub-layers
      - multi-head self-attention mechanism
      - postision-wise fully connected layer
    - residual
      - for two sub-layers independently
      - add & layer norm
    - d=512
  - decoder
    - N=6 identical layers
    - 3 sub-layers
      - [new] masked multi-head self-attention：combine了先验知识，output embedding只能基于在它之前的time-step的embedding计算
      - multi-head self-attention mechanism
      - postision-wise fully connected layer
    - residual
  - attention
    - reference：https://bbs.cvmart.net/articles/4032
    - step1：project embedding to query-key-value pairs
      - $Q = W_Q^{dd} A^{dN}$
      - $K = W_K^{dd} A^{dN}$
      - $V = W_V^{dd} A^{dN}$
    - step2：scaled dot-product attention
      - $A^{N*N}=softmax(K^TQ/\sqrt{d})$
      - $B^{dN} = V^{dN}A^{N*N}$
    - multi-head attention
      - 以上的step1&step2操作performs a single attention function
      - 事实上我们可以用多组projection得到多组$\{Q,K,V\}^h$，in parallel地执行attention运算，得到多组$\{B^{d*N}\}^h$
      - concat & project
        
        concat in d-dim：$B\in R^{(dh)N}$
        
        linear project：$out = W^{d(dh)} B$
      - h=8
      - $d_{in}/h=64$：embedding的dim
      - $d_{out}=64$：query-key-value的dim
  - positional encoding
    - 数学本质是一个hand-crafted的映射矩阵$W^P$和one-hot的编码向量$p$：
      $\left[ \begin{array}{ccc} a\\ e \end{array} \right ] = [W^I, W^P] \left[ \begin{array}{ccc} x\\ p \end{array} \right ]$
    - 用PE表示e
      - pos是sequence x上的position
      - 2i和2i+1是embedding a上的idx
  - point-wise feed-forward network
    - fc-ReLU-fc
    - dim_fc=2048
    - dim_in & dim_out = 512
- 运行过程
  - encoder是可以并行计算的
    - 输入是sequence embedding和positional embedding：$A\in R^{d*N}$
    - 经过repeated blocks
    - 输出是另外一个sequence：$B\in R^{d*N}$
    - self-attention：Q、K、V是一个东西
    - encoder的本质就是在解析自注意力：
    - 并行的全局两两比较，一步到位
      - RNN要by step
    - CNN要stack layers

decoder是在训练阶段是可以并行的，在inference阶段by step
- 输入是encoder的输出和上一个time-step decoder的输出embedding
- 输出是当前time-step对应position的输出词的概率
- 第一个attention layer是out embedding的self-attention：要实现像RNN一样依次解码出来，每个time step要用到上一个位置的输出作为输入——masking
  - given输入sequence是\ I have a cat，5个元素
  - 那么mask就是$R^{5*5}$的下三角矩阵
  - 输入embedding经过transformation变成Q、K、V三个矩阵
  - 仍旧是$A=K^TQ$计算attention
  - 这里有一些attention是非法的：位置靠前的query只能用到比他位置更靠前的query，因此要乘上mask矩阵：$A=M A$
  - softmax：$A=softmax(A)$
  - scale：$B = VA$
  - concat & projection
    - 第二个attention layer是in & out sequence的注意力，其key和value来自encoder，query来自上一个decoder block的输出

why self-attention
- 衡量维度
  - total computational complexity per layer
  - amount of computation that can be parallelized
  - path-length between long-range dependencies
- given input sequence with length N & dim $d_{in}$，output sequence with dim $d_{out}$
  - RNN need N sequencial operations of $W\in R^{d_{in} * d_{out}}$
  - CNN need N/k stacking layers of $d_{in}d_{out}$ sequence operations of $W\in R^{kk}$，generally是RNN的k倍
training
- optimizer：$Adam(lr, \beta_1=0.9, \beta_2=0.98, \epsilon=10^{-9})$
- lrschedule：warmup by 4000 steps，then decay
- dropout
  - residual dropout：就是stochastic depth
  - dropout to the sum of embeddings & PE for both encoder and decoder
  - drop_rate = 0.1
- label smoothing：smooth_factor = 0.1
实验
- A：vary the number of attention heads，发现多了少了都hurts
- B：reduce the dim of attention key，发现hurts
- C & D：大模型+dropout helps
- E：learnable & sincos PE：nearly identical
- 最后是big model的参数

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

动机
- BERT：Bidirectional Encoder Representations from Transformers
  - Bidirectional
  - Encoder
  - Representations
  - Transformers
- workflow
  - pretrain bidirectional representations from unlabeled text
  - tune with one additional output layer to obtain the model
- SOTA
  - GLUE score 80.5%
论点
- pretraining is effective in NLP tasks
  - feature-based method：use task-specfic architectures，仅使用pretrained model的特征
  - fine-tuining method：直接fine-tune预训练模型
  - 两种方法在预训练阶段训练目标一致：use unidirectional language models to learn general language representations
  - reduce the need for many heavily-engineered task- specific architectures
- current methods’ limitations
  - unidirectional：
    - limit the choice of architectures
    - 事实上token的上下文都很重要，不能只看上文
  - 简单的concat两个independent的L2R和R2L模型（biRNN）
    - independent
    - shallow concat
- BERT
  - masked language model：在一个sequence中预测被遮挡的词
  - next sentence prediction：trains text-pair representations
方法
- two steps
  - pre-training
    - unlabeled data
    - different pretraining tasks
  - fine-tuning
    - labeled data of the downstream tasks
    - fine-tune all the params
  - 两个阶段的模型，只有输出层不同
    - 例如问答模型
    - pretraining阶段，输入是两个sentence，输入的起始有一个CLS symbol，两个句子的分隔有一个SEP symbol
    - fine-tuning阶段，输入分别是问和答，【输出是啥？】
- architecture
  - multi-layer bidirectional Transformer encoder
    - number of transfomer blocks L
    - hidden size H
    - number of self-attention heads A
    - FFN dim 4H
  - Bert base：L=12，H=768，A=12
  - Bert large：L=24，H=1024，A=16
  - input/output representations
    - a single sentence / two packed up sentence：
      - 拼接的sentence用特殊token SEP衔接
      - segment embedding：同时add a learned embedding to every token indicating who it belongs
    - use WordPiece embeddings with 30000 token vocabulary
    - 输入sequence的第一个token永远是一个特殊符号CLS，它对应的final state输出作为sentence整体的representation，用于分类任务
    - overall网络的input representation是通过将token embeddings拼接上上特殊符号，加上SE和PE得到
- pre-training
  - two unsupervised tasks
    - Masked LM (MLM)
      - mask some percentage of the input tokens at random：15%
        
        80%的概率用MASK token替换
        
        10%的概率用random token替换
        
        10%的概率unchanged
      - then predict those masked tokens
      - the final hidden states corresponding to the masked tokens are fed into a softmax
      - 相比较于传统的left2right/right2left/concat模型
        
        既有前文又有后文
        
        只预测masked token，而不是全句预测
    - Next Sentence Prediction (NSP)
      - 对于relationship between sentences：
        
        例如question&answer，句子推断
        
        not direatly captured by language modeling，模型直观学习的是token relationship
      - binarized next sentence prediction task
        
        选取sentence A&B：
        
        50%的概率是真的上下文（IsNext）
        
        50%的概率是random（NotNext）
        
        构成了一个二分类问题：仍旧用CLS token对应的hidden state C来预测
- fine-tuning
  - BERT兼容many downstream tasks：single text or text pairs
  - 直接组好输入，end-to-end fine-tuning就行
  - 输出还是用CLS token对应的hidden state C来预测，接分类头

A Survey on Visual Transformer

动机
- provide a comprehensive overview of the recent advances in visual transformers
- discuss the potential directions for further improvement
- develop timeline
按照应用场景分类
- backbone：分类
- high/mid-level vision：通常是语义相关的，检测/分割/姿态估计
- low-level vision：对图像本身进行操作，超分/图像生成，目前应用较少
- video processing
revisiting transformer
- key-concepts：sentence、embedding、positional encoding、encoder、decoder、self-attention layer、encoder-decoder attention layer、multi-head attention、feed-forward neural network
- self-attention layer
  - input vector is transformed into 3 vectors
    - input vector is embedding+PE(pos,i)：pos是word在sequence中的位置，i是PE-element在embedding vec中的位置
    - query vec q
    - key vec k
    - value vec v
    - $d_q = d_k = d_v = d_{model} = 512$
  - then calculate：$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
  - encoder-decoder attention layer
    - K和V是从encoder中拿到
    - Q是从前一层拿到
    - 计算是相似的
- multi-head attention
  - 一个attention是一个softmax，对应了一对强相关，同时抑制了其他word的相关性
  - 考虑一个词往往与几个词强相关，这就需要多个attention
  - multi-head：different QKV matrices are used for different heads
  - given a input vector，the number of heads h
    - 先产生h个 pairs
    - $d_q=d_k=d_v=d_{model}/h=64$
    - 这h个pair，分别计算attention vector，得到h个[b,d]的context vector
    - concat along-d-axis and linear projection to final [b,d] vector
- residual & layer-norm：layer-norm在residual-add以后
- feed-forward network
  - fc-GeLU-fc
  - $d_h=2048$
- final-layer in decoder
  - dense+softmax
  - $d_{words}=$ number of words in the vocabulary
- when applied in CV tasks
  - most transformers adopt the original transformer’s encoder module
  - used as a feature selector
  - 相比较于CNN，能够capture long-distance characteristics，derive global information
  - 相比较于RNN，能够并行计算
- 计算量
  - 首先是三个线性层：线性时间复杂度O(n)，计算量与$d_{model}$成正比
  - 然后是self-attention层：QKV矩阵乘法运算，平方时间复杂度O(n^2)
  - multi-head的话，还有一个线性层：平方时间复杂度O(n^2)
revisiting transformers for NLP
- 最早期的RNN + attention：rnn的sequential本质影响了长距离/并行化/大模型
- transformer的solely attention结构：解决以上问题，促进了large pre-trained models (PTMs) for NLP
- BERT and its variants
  - are a series of PTMs built on the multi-layer transformer encoder architecture
  - pre-trained
    - Masked language modeling
    - Next sentence prediction
  - fine-tuned
    - add an output layer
- Generative Pre-trained Transformer models (GPT)
  - are another type of PTMs based on the transformer decoder architecture
  - masked self-attention mechanisms
  - pre-trained
    - 与BERT最大的不同是有向性
visual transformer
- 【category1】: backbone for image classification
  - transformer的输入是tokens，在NLP里是embedding形式的分词序列，在CV里就是representing a certain semantic concept的visual token
    - visual token可以来自CNN的feature
    - 也可以直接来自image的小patch
  - purely use transformer来做image classification任务的模型有iGPT、ViT、DeiT
  - iGPT
    - pretraining stage + finetuning stage
    - pre-training stage
      - self-supervised：自监督，所以结果较差
      - given an unlabeled dataset
      - train the model by minimizing the -log(density)，感觉是在force光栅排序正确
    - fine-tuning stage
      - average pool + fc + softmax
      - jointly train with L_gen & L_CE
  - ViT
    - pre-trained on large datasets
      - standard transformer’s encoder + MLP head
      - treats all patches equally
      - 有一个类似BERT class token的东西
        
        从训练的角度，gather knowledge of the entire class
        
        inference的时候，只拿了这第一个logit用来做预测
    - fine-tuning
      - 换一个zero-initialized的MLP head
      - use higher resolution & 插值pe
  - DeiT
    - Data-efficient image transformer
    - better performance with
      - a more cautious training strategy
      - and a token-based distillation
- 【category2】: High/Mid-level Vision
- 【category3】: Low-level Vision
- 【category4】: Video Processing
- efficient transformer：瘦身&加速
  - Pruning and Decomposition
  - Knowledge Distillation
  - Quantization
  - Compact Architecture Design

ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

动机
- attention in vision
  - either in conjunction with CNN
  - or replace certain part of a CNN
  - overall都还是CNN-based
- use a pure transformer to sequence of image patches
- verified on image classification tasks in supervised fashion
论点
- transformer lack some inductive biases inherent to CNNs，所以在insufficient data上not generalize well
- however large scale training trumps inductive bias，大数据集上ViT更好
- naive application of self-attention
  - 建立pixel之间的两两关联：计算量太大了
  - 需要approximation：local/改变size
- we use transformer
  - wih global self-attention
  - to full-sized images
方法
- input 1D-embedding sequence
  - 将image $x\in R^{HWC}$ 展开成patches $\{x_p \in R^{P^2C}\}$
  - thus sequence length $N=HW/P^2$
  - patch embedding：
    - use a trainable linear projection
    - fixed dimension size through-all
  - position embedding：
    - add to patch embedding
    - standard learnable 1D position embedding
  - prepended embedding：
    - 前置的learnable embedding $x_{class}$
    - similar to BERT’s class token
  - 以上三个embedding组合起来，作为输入sequence
- transformer encoder
  - follow the original Transformer
  - 交替的MSA和MLP
  - layer norm LN
  - residual
  - GELU
- hybrid architecture
  - input sequence也可以来源于CNN的feature maps
  - patch size可以是1x1
- classification head
  - attached to $z_L^0$：是class token用来做预测
  - pre-training的时候是MLP
  - fine-tuning的时候换一个zero-initialized的single linear layer
- workflow
  - typically先pre-train on large datasets
  - 再fine-tune to downstream tasks
  - fine-tune的时候替换一个zero-initialized的新线性分类头
  - when feeding images with higher resolution
    - keep the patch size
    - results in larger sequence length
    - 这时候pre-trained PE就no longer meaningful了
    - we therefore perform 2D interpolation基于它在原图上的位置
- training details
  - Adam：$\beta_1=0.9，\beta_2=0.999$
  - batch size 4096
  - high weight decay 0.1
  - linear lr warmup & decay
- fine-tuning details
  - SGDM
  - cosine LR
  - no weight decay
  - 【？？？？】average 0.9999

win Transformer: Hierarchical Vision Transformer using Shifted Windows

动机
- use Transformer as visual tasks’ backbone
- challenges of Transformer in vision domain
  - large variations of scales of the visual entities
  - high resolution of pixels
- we propose hierarchical Transformer
  - shifted windows
  - self-attention in local windows
  - cross-window connection
- verified on
  - classification：ImageNet top1 acc 86.4
  - detection：COCO box-MAP 58.7
  - segmentation：ADE20K
  - this paper主要介绍分类，检测是以swin作为backbone，用MaskRCNN等二阶段架构来训练的，分割是以swin作为backbone，用UperNet去训练的，具体模型配置official repo的readme里面有详细列表
论点
- when transfer Transformer’s high performance in NLP domain to CV domain
  - differences between the two modalities
    - scale：NLP里面，word tokens serves as the basic element，但是CV里面，patch的形态大小都是可变的，previous methods里面，都是统一设定固定大小的patch token
    - resolution：主要问题就是self-attention的计算复杂度，是image size的平方
  - we propose Swin Transformer
    - hierarchial feature maps
    - linear computatoinal complexity to image size
- hierarchical
  - start from small patches
  - merge in deeper layers
  - 所以对不同尺度的特征patch进行了融合
- linear complexity
  - compute self-attention locally in each window
  - 每个window的number of patches是设定好的，window数是与image size成正比的
  - 所以是线性
- shifted window approach
  - 跨层的window shift，建立起相邻window间的桥梁
  - 【QUESTION】all query patches within a window share the same key set
- previous attemptations of Transformer
  - self-attention based backbone architectures
    - 将部分/全部conv layers替换成self-attention
    - 模型主体架构还是ResNet
    - slightly better acc
    - larger latency caused by self-att
  - self-attention complement CNNs
    - 作为additional block，给到backbone/head，提供长距离信息
    - 有些检测/分割网络也开始用了transformer的encoder-decoder结构
  - transformer-based vision backbones
    - 主要就是ViT及其衍生品
    - ViT requires large-scale training sets
    - DeiT introduces training strategies
    - 但是还存在high resolution计算量的问题
方法
- overview
  - Swin-T：tiny version
  - 第一步是patch partition：
    - 将RGB图切成non-overlapping patches
    - patches：token，basic element
    - feature input dim：with patch size 4x4，dim=4x4x3=48
  - 然后是linear embedding layer
    - 将raw feature re-projection到指定维度
    - 指定维度C：default=96
  - 接下来是Swin Transformer blocks
    - the number of tokens maintain
  - patch merging layers负责reduce the number of tokens
    - 第一个patch merging layer concat 所有2x2的neighbor patches：4C-dim vec each
    - 然后用了一个线性层re-projection
    - number of tokens（resolution）：（H/4*W/4）/4 = （H/8*W/8），跟常规的CNN一样变化的
    - token dims：2C
    - 后面接上一个Transformer blocks
    - 合起来叫stage2（stage3、stage4）
- Swin Transformer blocks
  - 跟原始的Transformer block比，就是把原始的MSA替换成了window-based的MSA
  - 原始的attention：global computation leads to quadratic complexity
  - window-based attention：
    - attention的计算只发生在每个window内部
    - non-overlapping partition
    - 很显然lacks connections across windows
  - shifted window partitioning in successive blocks
    - 两个attention block
    - 第一个用常规的window partitioning strategy：从左上角开始，take M=4，window size 4x4（一个window里面包含4x4个patch）
    - 第二层的window，基于前一层，各平移M/2
    - introduce connections between neighbor non-overlapping windows in the previous layer
    - efficient computation
      - shifted window会导致window尺寸不一致，不利于并行计算
  - relative position bias
    - 我们在MxM的window内部计算local attention：也就是input sequence的time-step是$M^2$
    - Q、K、V $\in R ^ {M^2 d}$
    - $Attention(Q,K,V)=Softmax(QK^T/\sqrt{d}+B)V$
    - 这个B作为local的position bias，在二维上，在每个轴上的变化范围[-M+1,M-1]
    - we parameterized a smaller-sized bias matrix $\hat B\in R ^{(2M-1)*(2M-1)}$
    - values in $B \in R ^ {M^2*M^2}$ are taken from $\hat B$
    - the learnt relative position bias可以用来initialize fine-tuned model
- Architecture variants
  - base model：Swin-B，参数量对标ViT-B
  - Swin-T：0.25x，对标ResNet-50 (DeiT-S)
  - Swin-S：0.5x，对标ResNet-101
  - Swin-L：2x
  - window size：M=7
  - query dim：d=32，（每个stage的input sequence dim逐渐x2，heads num逐渐x2）
  - MLP：expansion ratio=4
  - channel number C：第一个stage的embdding dim，（后续逐渐x2）
  - hypers：
    - drop_rate：0.0
    - drop_path_rate：0.1
  - acc
official repo: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md
- keras官方也出了一版：https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py
- model zoo
  
  model | resolution | C | num_layers | num_heads | window_size
  
  Swin-T | 224 | 96 | {2,2,6,2} | {3,6,12,24} | 7
  
  Swin-S | 224 | 96 | {2,2,18,2} | {3,6,12,24} | 7
  
  Swin-B | 224/384 | 128 | {2,2,18,2} | {4,8,16,32} | 7/12
  
  Swin-L | 224/384 | 192 | {2,2,18,2} | {6,12,24,48} | 7/12
- models/build.py
  - SwinTransformer & SwinMLP：前者就是论文里的，basic block是transformer的MSA加上MLP layers，后者是没用MSA，就用MLP来建模相邻windows之间的global relationship的，用的conv1d。

DETR: End-to-End Object Detection with Transformers

动机
- new task formulation：a direct set prediction problem
- main gradients
  - a set-based global loss
  - a transformer en-de architecture
  - remove the hand-designed componets like nms & anchor
- acc & run-time on par with Faster R-CNN on COCO
  - significantly better performance on large objects
  - lower performances on small objects
论点
- modern detectors run object detection in an indirect way
  - 基于格子/anchor/proposals进行回归和分类
  - 算法性能受制于nms机制、anchor设计、target-anchor的匹配机制
- end-to-end approach
  - transformer的self-attention机制，explicitly model all pairwise interactions between elements：内含了去重（nms）的能力
  - bipartite matching：set loss function，将预测和gt的box一一匹配，run in parallel
  - DETR does not require any customized layers, thus can be reproduced easily
  - expand to segmentation task：a simple segmentation head trained on top of a pre-trained DETR
- set prediction：to predict a set of bounding boxes and the categories for each
  - basic：multilabel classification
  - detection task has near-duplicates issues
  - set prediction是postprocessing-free的，它的global inference schemes能够avoid redundancy
  - usual loss：bipartite match
- object detection
  - set-based loss
    - modern detectors use non-unique assignment rules together with NMS
    - bipartite matching是target和pred一一对应
方法
- overall
  - three main components
    - a CNN backbone
    - an encoder-decoder transformer
    - a simple FFN
- backbone
  - conventional r50
  - input：$[H_0, W_0, 3]$
  - output：$[H,W,C], H=\frac{H_0}{32}, W=\frac{W_0}{32}, C=2048$
- transformer encoder
  - reduce channel dim to $d$：1x1 conv，$d=512$
  - collapse the spatial dimensions：feature sequence [d, HW]，每个spatial pixel作为一个feature
  - fixed positional encodings：
    - added to the input of each attention layer
    - 【QUESTION】加在K和Q上还是embedding上？
- transformer decoder
  - 输入N个dim=d的embedding
    - 叫object queries：表示我们预测固定值N个目标
    - 因为decoder也是permutation-invariant的（因为all shared），所以要输入N个不一样的embedding
    - learnt positional encodings
    - add them to the input of each attention layer
  - decodes the N objects in parallel
- prediction FFN
  - 3 layer，ReLU，
  - box prediction：normalized center coords & height & width
  - class prediction：
    - an additional class label $\varnothing$ 表示no object
- auxiliary losses
  - each decoder layer后面都接一个FFN prediction和Hungarian loss
  - shared FFN
  - an additional shared LN to norm the inputs of FFN
  - three components of the loss
    - class loss：CE loss
    - box loss
      - GIOU loss
      - L1 loss
- technical details
  - AdamW：
    - initial transformer lr=10e-4
    - initial backbone lr=10e-5
    - weight decay=10e-4
  - Xavier init
  - imagenet-pretrained resnet weights with frozen batchnorm layers：r50 & r101，DETR & DETR-R101
  - a variant：
    - increase feature resolution version
    - remove stage5’s stride and add a dilation
    - DETR-DC5 & DETR-DC5-R101
    - improve performance for small objects
    - overall 2x computation increase
  - augmentation
    - resize input
    - random crop：with 0.5 prob then resize
  - transformer default dropout 0.1
  - lr schedule
    - 300 epochs
    - drop by factor 10 after 200 epochs
  - 4 images per GPU，total batch 64
- for segmentation task：全景分割
  - 给decoder outputs加mask head
  - compute multi-head attention among
    - decoder box predictions
    - encoder outputs
  - generate M attention heatmaps per object
  - add a FPN styled CNN to recover resolution
  - pixel-wise argmax

UNETR: Transformers for 3D Medical Image Segmentation

动机
- unet结构用于医学分割
  - encoder learns global context
  - decoder utilize the representations to predict the semanic ouputs
  - the locality of CNN limits long-range spatial dependency
- our method
  - use a pure transformer as the encoder
  - learn sequence representations of the input volume
  - global
  - multi-scale
  - encoder directly connects to decoder with skip connections
论点
- unet结构
  - encoder用来提取全图特征
  - decoder用来recover
  - skip connections用来补充spatial information that is lost during downsampling
  - localized receptive fields：
    - disadvantage in capturing multi-scale contextual information
    - 如不同尺寸的脑肿瘤
    - 缓和手段：atrous convs，still limited
- transformer
  - self-attention mechanism in NLP
    - highlight the important features of word sequences
    - learn its long-range dependencies
  - in ViT
    - an image is represented as a patch embedding sequence
- our method
  - formulation
    - 1D seq2seq problem
    - use embedded patches
  - the first completely transformer-based encoder
- other unet- transformer methods
  - 2D (ours 3D)
  - employ only in the bottleneck (ours pure transformer)
  - CNN & transformer in separate streams and fuse
方法
- overview
- transformer encoder
  - input：1D sequence of input embeddings
  - given 3D volume $x \in R^{HWDC}$
  - divide into flattened uniform non-overlapping patches $x\in R^{LCN^3}$
    - $L=HWD/N^3$：the sequence length
    - $N^3$：patch dimension
  - linear projection to K-dim $E \in R^{LCK}$：remain constant through transformer
  - 1D learnable positional embedding $E_{pos} \in R^LD$
  - 12 self-att blocks：MSA + MLP
- decoder &skip connections
  - 选取encoder第{3,6,9,12}个block的输出
  - reshape back to 3D volume $[\frac{H}{N},\frac{W}{N},\frac{D}{N},C]$
  - consecutive 3x3x3 conv+BN+ReLU
  - bottleneck
    - deconv by 2 to increase resolution
    - then concat with the previous resized feature
    - then jointly consecutive conv
    - then upsample with deconv…
  - concat到原图resolution以后，consecutive conv以后，再1x1x1 conv+softmax
- loss
  - dice loss
    - dice：for each class channel，计算dice，然后求类平均
    - 1-dice
  - ce loss
    - for each pixel，求bce，然后求所有pixel的平均