startup
reference1:https://mp.weixin.qq.com/s/Rm899vLhmZ5eCjuy6mW_HA
reference2:https://zhuanlan.zhihu.com/p/308301901
NLP & RNN
- 文本涉及上下文关系
- RNN时序串行,建立前后关系
- 缺点:对超长依赖关系失效,不好并行化
NLP & CNN
- 文本是1维时间序列
- 1D CNN,并行计算
- 缺点:CNN擅长局部信息,卷积核尺寸和长距离依赖的balance
NLP & transformer
- 对流入的每个单词,建立其对词库的权重映射,权重代表attention
- 自注意力机制
建立长距离依赖
put in CV
- 插入类似的自注意力层
- 完全抛弃卷积层,使用Transformers
RNN & LSTM & GRU cell
标准要素:输入x、输出y、隐层状态h
RNN
- RNN cell每次接收一个当前输入$x_t$,和前一步的隐层输出$h_{t-1}$,然后产生一个新的隐层状态$h_t$,也是当前的输出$y_t$
- formulation:$y_t, h_t = f(x_t, h_{t-1})$
- same parameters for each time step:同一个cell每个time step的权重共享
一个问题:梯度消失/爆炸
- 考虑hidden states’ chain的简化形式:$h_t = \theta^t h_0$,一个sequence forward下去就是same weights multiplied over and over again
- 另外tanh也是会让神经元梯度消失/爆炸
LSTM
key ingredient
- cell:增加了一条cell state workflow,优化梯度流
- gate:通过门结构删选携带信息,优化长距离关联
可以看到LSTM的循环状态有两个:细胞状态$c_t$和隐层状态$h_t$,输出的$y_t$仍旧是$h_t$
GRU
LSTM的变体,仍旧是门结构,比LSTM结构简单,参数量小,据说更好训练
papers
[一个列了很多论文的主页] https://github.com/dk-liang/Awesome-Visual-Transformer
[经典考古]
* [Seq2Seq 2014] Sequence to Sequence Learning with Neural Networks,Google,最早的encoder-decoder stacking LSTM用于机翻
* [self-attention/Transformer 2017] Transformer: Attention Is All You Need,Google,
* [bert 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,Google,NLP,输入single sentence/patched sentences,用Transformer encoder提取bidirectional cross sentence representation,用输出的第一个logit进行分类
[综述]
* [综述2020] Efficient Transformers: A Survey,Google,
* [综述2021] Transformers in Vision: A Survey,迪拜,
* [综述2021] A Survey on Visual Transformer,华为,
[classification]
* [ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,Google,分类任务,用transformer的encoder替换CNN再加分类头,每个feature patch作为一个input embedding,channel dim是vector dim,可以看到跟bert基本一样,就是input sequence换成patch,后续基于它的提升有DeiT、LV-ViT
* [BotNet 2021] Bottleneck Transformers for Visual Recognition,Google,将CNN backbone最后几个stage替换成MSA
* [CvT 2021] CvT: Introducing Convolutions to Vision Transformers,微软,
* [Swin 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,微软
* [PVT2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,跟swin一样也是multi-scale features
[detection]
* [DeTR 2020] DeTR: End-to-End Object Detection with Transformers,Facebook,目标检测,CNN+transformer(en-de)+预测头,每个feature pixel作为一个input embedding,channel dim是vector dim
* [Deformable DETR]
* [Anchor DETR]
* 详见《det-transformers》
[segmentation]
* [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,复旦,水,感觉就是把FCN的back换成transformer
[Unet+Transformer]:
* [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation,英伟达,直接使用transformer encoder做unet encoder
* [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,encoder stream里面加transformer block
* [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation,大学,CNN feature和Transformer feature进行bifusion
* 详见《seg-transformers》
Sequence to Sequence
[a keras tutorial][https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html]
general case
- extract the information of the entire input sequence
- then start generate the output sequence
seq2seq model workflow
- a (stacking of) RNN layer acts as encoder
- processes the input sequence
- returns its own internal state:不要RNN的outputs,只要internal states
- encoder编码得到的东西叫Context Vector
- a (stacking of) RNN layer acts as decoder
- given previous characters of the target sequence
- it is trained to predict the next characters of the target sequence
- teacher forcing:
- 输入是target sequence,训练目标是使模型输出offset by one timestep的target sequence
- 也可以不teacher forcing:直接把预测作为next step的输入
- Context Vector的同质性:每个step,decoder都读取一样的Context Vector作为initial_state
- when inference
- 第一步获取input sequence的state vectors
- repeat
- 给decoder输入input states和out sequence(begin with a 起始符)
- 从prediction中拿到next character
- append the character to the output sequence
- until:得到end character / hit the character limit
- a (stacking of) RNN layer acts as encoder
implementation
https://github.com/AmberzzZZ/transformer/blob/master/seq2seq.py
one step further
- 改进方向
- bi-directional RNN:粗暴反转序列,有效涨点
- attention:本质是将encoder的输出Context Vector加权
- ConvS2S:还没看
- 主要都是针对RNN的缺陷提出
- 改进方向
动机
- present a general end-to-end sequence learning approach
- multi-layered LSTMs
- encode the input seq to a fix-dim vector
- decode the target seq from the fix-dim vector
- LSTM did not have difficulty on long sentences
reversing the order of the words improved performance
- present a general end-to-end sequence learning approach
方法
standard RNN
given a sequence $(x_1, x_2, …, x_T)$
iterating:
如果输入、输出的长度事先已知且固定,一个RNN网络就能建模seq2seq model了
如果输入、输出的长度不同、并且服从一些更复杂的关系?就得用两个RNN网络,一个将input seq映射成fixed-sized vector,另一个将vector映射成output seq,but long-term-dependency issue
LSTM
- LSTM是始终带着全部seq的信息的,如上图那样
our actual model
- use two LSTMs:encoder-decoder能够增加参数量
- an LSTM with four layers:deeper
- input sequence倒序:真正的句首更接近trans的句首,makes it easy for SGD to establish communication
training details
- LSTM:4 layers,1000 cells
- word-embedding:1000-dim,(input vocab 160,000, output vocab 80,000)
- naive softmax
- uniform initialization:(-0.08, 0.08)
- SGD,lr=0.7,half by every half epoch,total 7.5 epochs
- gradient norm [10, 25]
- all sentences in a minibatch are roughly of the same length
Transformer: Attention Is All You Need
动机
- sequence2sequence models
- encoder + decoder
- RNN / CNN + an attention path
- we propose Transformer
- base solely on attention mechanisms
- more parallelizable and less training time
- sequence2sequence models
论点
- sequence modeling
- 主流:RNN,LSTM,gated
- align the positions to computing time steps
- sequential本质阻碍并行化
- Attention mechanisms acts as a integral part
- in previous work used in conjunction with the RNN
- 为了并行化
- some methods use CNN as basic building blocks
- difficult to learn dependencies between distant positions
- 主流:RNN,LSTM,gated
- we propose Transformer
- rely entirely on an attention mechanism
- draw global dependencies
- self-attention
- relating different positions of a single sequence
- to generate a overall representation of the sequence
- sequence modeling
方法
encoder-decoder
- encoder:doc2emb
- given an input sequence of symbol representation $(x_1, x_2, …, x_n)$
- map to a sequence of continuous representations $(z_1, z_2, …, z_n)$,(embeddings)
- decoder:hidden layers
- given embeddings z
- generate an output sequence $(y_1, y_2, …, y_m)$ one element at a time
- the previous generated symbols are served as additional input when computing the current time step
- encoder:doc2emb
Transformer Architecture
Transformer use
- for both encoder and decoder
stacked self-attention and point-wise fully-connected layers
encoder
- N=6 identical layers
- each layer has 2 sub-layers
- multi-head self-attention mechanism
- postision-wise fully connected layer
- residual
- for two sub-layers independently
- add & layer norm
- d=512
decoder
- N=6 identical layers
- 3 sub-layers
- [new] masked multi-head self-attention:combine了先验知识,output embedding只能基于在它之前的time-step的embedding计算
- multi-head self-attention mechanism
- postision-wise fully connected layer
- residual
attention
- reference:https://bbs.cvmart.net/articles/4032
- step1:project embedding to query-key-value pairs
- $Q = W_Q^{dd} A^{dN}$
- $K = W_K^{dd} A^{dN}$
- $V = W_V^{dd} A^{dN}$
- step2:scaled dot-product attention
- $A^{N*N}=softmax(K^TQ/\sqrt{d})$
- $B^{dN} = V^{dN}A^{N*N}$
- multi-head attention
- 以上的step1&step2操作performs a single attention function
- 事实上我们可以用多组projection得到多组$\{Q,K,V\}^h$,in parallel地执行attention运算,得到多组$\{B^{d*N}\}^h$
- concat & project
- concat in d-dim:$B\in R^{(dh)N}$
- linear project:$out = W^{d(dh)} B$
- h=8
- $d_{in}/h=64$:embedding的dim
- $d_{out}=64$:query-key-value的dim
positional encoding
数学本质是一个hand-crafted的映射矩阵$W^P$和one-hot的编码向量$p$:
用PE表示e
- pos是sequence x上的position
- 2i和2i+1是embedding a上的idx
point-wise feed-forward network
- fc-ReLU-fc
- dim_fc=2048
- dim_in & dim_out = 512
运行过程
encoder是可以并行计算的
- 输入是sequence embedding和positional embedding:$A\in R^{d*N}$
- 经过repeated blocks
- 输出是另外一个sequence:$B\in R^{d*N}$
- self-attention:Q、K、V是一个东西
- encoder的本质就是在解析自注意力:
- 并行的全局两两比较,一步到位
- RNN要by step
- CNN要stack layers
decoder是在训练阶段是可以并行的,在inference阶段by step
输入是encoder的输出和上一个time-step decoder的输出embedding
输出是当前time-step对应position的输出词的概率
第一个attention layer是out embedding的self-attention:要实现像RNN一样依次解码出来,每个time step要用到上一个位置的输出作为输入——masking
- given输入sequence是\ I have a cat,5个元素
- 那么mask就是$R^{5*5}$的下三角矩阵
输入embedding经过transformation变成Q、K、V三个矩阵
仍旧是$A=K^TQ$计算attention
这里有一些attention是非法的:位置靠前的query只能用到比他位置更靠前的query,因此要乘上mask矩阵:$A=M A$
softmax:$A=softmax(A)$
scale:$B = VA$
concat & projection
- 第二个attention layer是in & out sequence的注意力,其key和value来自encoder,query来自上一个decoder block的输出
why self-attention
- 衡量维度
- total computational complexity per layer
- amount of computation that can be parallelized
- path-length between long-range dependencies
- given input sequence with length N & dim $d_{in}$,output sequence with dim $d_{out}$
- RNN need N sequencial operations of $W\in R^{d_{in} * d_{out}}$
- CNN need N/k stacking layers of $d_{in}d_{out}$ sequence operations of $W\in R^{kk}$,generally是RNN的k倍
- 衡量维度
training
- optimizer:$Adam(lr, \beta_1=0.9, \beta_2=0.98, \epsilon=10^{-9})$
- lrschedule:warmup by 4000 steps,then decay
dropout
- residual dropout:就是stochastic depth
- dropout to the sum of embeddings & PE for both encoder and decoder
- drop_rate = 0.1
label smoothing:smooth_factor = 0.1
实验
- A:vary the number of attention heads,发现多了少了都hurts
- B:reduce the dim of attention key,发现hurts
- C & D:大模型+dropout helps
- E:learnable & sincos PE:nearly identical
最后是big model的参数
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
动机
- BERT:Bidirectional Encoder Representations from Transformers
- Bidirectional
- Encoder
- Representations
- Transformers
- workflow
- pretrain bidirectional representations from unlabeled text
- tune with one additional output layer to obtain the model
- SOTA
- GLUE score 80.5%
- BERT:Bidirectional Encoder Representations from Transformers
论点
- pretraining is effective in NLP tasks
- feature-based method:use task-specfic architectures,仅使用pretrained model的特征
- fine-tuining method:直接fine-tune预训练模型
- 两种方法在预训练阶段训练目标一致:use unidirectional language models to learn general language representations
- reduce the need for many heavily-engineered task- specific architectures
- current methods’ limitations
- unidirectional:
- limit the choice of architectures
- 事实上token的上下文都很重要,不能只看上文
- 简单的concat两个independent的L2R和R2L模型(biRNN)
- independent
- shallow concat
- unidirectional:
- BERT
- masked language model:在一个sequence中预测被遮挡的词
- next sentence prediction:trains text-pair representations
- pretraining is effective in NLP tasks
方法
two steps
- pre-training
- unlabeled data
- different pretraining tasks
- fine-tuning
- labeled data of the downstream tasks
- fine-tune all the params
两个阶段的模型,只有输出层不同
- 例如问答模型
- pretraining阶段,输入是两个sentence,输入的起始有一个CLS symbol,两个句子的分隔有一个SEP symbol
- fine-tuning阶段,输入分别是问和答,【输出是啥?】
- pre-training
architecture
multi-layer bidirectional Transformer encoder
- number of transfomer blocks L
- hidden size H
- number of self-attention heads A
- FFN dim 4H
Bert base:L=12,H=768,A=12
Bert large:L=24,H=1024,A=16
input/output representations
- a single sentence / two packed up sentence:
- 拼接的sentence用特殊token SEP衔接
- segment embedding:同时add a learned embedding to every token indicating who it belongs
- use WordPiece embeddings with 30000 token vocabulary
- 输入sequence的第一个token永远是一个特殊符号CLS,它对应的final state输出作为sentence整体的representation,用于分类任务
overall网络的input representation是通过将token embeddings拼接上上特殊符号,加上SE和PE得到
- a single sentence / two packed up sentence:
pre-training
- two unsupervised tasks
- Masked LM (MLM)
- mask some percentage of the input tokens at random:15%
- 80%的概率用MASK token替换
- 10%的概率用random token替换
- 10%的概率unchanged
- then predict those masked tokens
- the final hidden states corresponding to the masked tokens are fed into a softmax
- 相比较于传统的left2right/right2left/concat模型
- 既有前文又有后文
- 只预测masked token,而不是全句预测
- mask some percentage of the input tokens at random:15%
- Next Sentence Prediction (NSP)
- 对于relationship between sentences:
- 例如question&answer,句子推断
- not direatly captured by language modeling,模型直观学习的是token relationship
- binarized next sentence prediction task
- 选取sentence A&B:
- 50%的概率是真的上下文(IsNext)
- 50%的概率是random(NotNext)
- 构成了一个二分类问题:仍旧用CLS token对应的hidden state C来预测
- 选取sentence A&B:
- 对于relationship between sentences:
- Masked LM (MLM)
- two unsupervised tasks
fine-tuning
- BERT兼容many downstream tasks:single text or text pairs
- 直接组好输入,end-to-end fine-tuning就行
- 输出还是用CLS token对应的hidden state C来预测,接分类头
A Survey on Visual Transformer
动机
- provide a comprehensive overview of the recent advances in visual transformers
- discuss the potential directions for further improvement
develop timeline
按照应用场景分类
- backbone:分类
- high/mid-level vision:通常是语义相关的,检测/分割/姿态估计
- low-level vision:对图像本身进行操作,超分/图像生成,目前应用较少
video processing
revisiting transformer
key-concepts:sentence、embedding、positional encoding、encoder、decoder、self-attention layer、encoder-decoder attention layer、multi-head attention、feed-forward neural network
self-attention layer
- input vector is transformed into 3 vectors
- input vector is embedding+PE(pos,i):pos是word在sequence中的位置,i是PE-element在embedding vec中的位置
- query vec q
- key vec k
- value vec v
- $d_q = d_k = d_v = d_{model} = 512$
- then calculate:$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
- encoder-decoder attention layer
- K和V是从encoder中拿到
- Q是从前一层拿到
- 计算是相似的
- input vector is transformed into 3 vectors
- multi-head attention
- 一个attention是一个softmax,对应了一对强相关,同时抑制了其他word的相关性
- 考虑一个词往往与几个词强相关,这就需要多个attention
- multi-head:different QKV matrices are used for different heads
- given a input vector,the number of heads h
- 先产生h个
pairs
- $d_q=d_k=d_v=d_{model}/h=64$
- 这h个pair,分别计算attention vector,得到h个[b,d]的context vector
- concat along-d-axis and linear projection to final [b,d] vector
- 先产生h个
- residual & layer-norm:layer-norm在residual-add以后
- feed-forward network
- fc-GeLU-fc
- $d_h=2048$
- final-layer in decoder
- dense+softmax
- $d_{words}=$ number of words in the vocabulary
- when applied in CV tasks
- most transformers adopt the original transformer’s encoder module
- used as a feature selector
- 相比较于CNN,能够capture long-distance characteristics,derive global information
- 相比较于RNN,能够并行计算
- 计算量
- 首先是三个线性层:线性时间复杂度O(n),计算量与$d_{model}$成正比
- 然后是self-attention层:QKV矩阵乘法运算,平方时间复杂度O(n^2)
- multi-head的话,还有一个线性层:平方时间复杂度O(n^2)
revisiting transformers for NLP
- 最早期的RNN + attention:rnn的sequential本质影响了长距离/并行化/大模型
transformer的solely attention结构:解决以上问题,促进了large pre-trained models (PTMs) for NLP
BERT and its variants
- are a series of PTMs built on the multi-layer transformer encoder architecture
- pre-trained
- Masked language modeling
- Next sentence prediction
- fine-tuned
- add an output layer
- Generative Pre-trained Transformer models (GPT)
- are another type of PTMs based on the transformer decoder architecture
- masked self-attention mechanisms
- pre-trained
- 与BERT最大的不同是有向性
visual transformer
【category1】: backbone for image classification
- transformer的输入是tokens,在NLP里是embedding形式的分词序列,在CV里就是representing a certain semantic concept的visual token
- visual token可以来自CNN的feature
- 也可以直接来自image的小patch
purely use transformer来做image classification任务的模型有iGPT、ViT、DeiT
iGPT
- pretraining stage + finetuning stage
- pre-training stage
- self-supervised:自监督,所以结果较差
- given an unlabeled dataset
- train the model by minimizing the -log(density),感觉是在force光栅排序正确
- fine-tuning stage
- average pool + fc + softmax
- jointly train with L_gen & L_CE
- ViT
- pre-trained on large datasets
- standard transformer’s encoder + MLP head
- treats all patches equally
- 有一个类似BERT class token的东西
- 从训练的角度,gather knowledge of the entire class
- inference的时候,只拿了这第一个logit用来做预测
- fine-tuning
- 换一个zero-initialized的MLP head
- use higher resolution & 插值pe
- pre-trained on large datasets
- DeiT
- Data-efficient image transformer
- better performance with
- a more cautious training strategy
- and a token-based distillation
- transformer的输入是tokens,在NLP里是embedding形式的分词序列,在CV里就是representing a certain semantic concept的visual token
【category2】: High/Mid-level Vision
【category3】: Low-level Vision
【category4】: Video Processing
efficient transformer:瘦身&加速
- Pruning and Decomposition
- Knowledge Distillation
- Quantization
Compact Architecture Design
ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
动机
- attention in vision
- either in conjunction with CNN
- or replace certain part of a CNN
- overall都还是CNN-based
- use a pure transformer to sequence of image patches
- verified on image classification tasks in supervised fashion
- attention in vision
论点
- transformer lack some inductive biases inherent to CNNs,所以在insufficient data上not generalize well
- however large scale training trumps inductive bias,大数据集上ViT更好
- naive application of self-attention
- 建立pixel之间的两两关联:计算量太大了
- 需要approximation:local/改变size
- we use transformer
- wih global self-attention
- to full-sized images
方法
input 1D-embedding sequence
- 将image $x\in R^{HWC}$ 展开成patches $\{x_p \in R^{P^2C}\}$
- thus sequence length $N=HW/P^2$
- patch embedding:
- use a trainable linear projection
- fixed dimension size through-all
- position embedding:
- add to patch embedding
- standard learnable 1D position embedding
- prepended embedding:
- 前置的learnable embedding $x_{class}$
- similar to BERT’s class token
- 以上三个embedding组合起来,作为输入sequence
transformer encoder
- follow the original Transformer
- 交替的MSA和MLP
- layer norm LN
- residual
GELU
hybrid architecture
- input sequence也可以来源于CNN的feature maps
- patch size可以是1x1
classification head
- attached to $z_L^0$:是class token用来做预测
- pre-training的时候是MLP
- fine-tuning的时候换一个zero-initialized的single linear layer
workflow
- typically先pre-train on large datasets
- 再fine-tune to downstream tasks
- fine-tune的时候替换一个zero-initialized的新线性分类头
- when feeding images with higher resolution
- keep the patch size
- results in larger sequence length
- 这时候pre-trained PE就no longer meaningful了
- we therefore perform 2D interpolation基于它在原图上的位置
training details
- Adam:$\beta_1=0.9,\beta_2=0.999$
- batch size 4096
- high weight decay 0.1
- linear lr warmup & decay
fine-tuning details
- SGDM
- cosine LR
- no weight decay
- 【????】average 0.9999
win Transformer: Hierarchical Vision Transformer using Shifted Windows
动机
- use Transformer as visual tasks’ backbone
- challenges of Transformer in vision domain
- large variations of scales of the visual entities
- high resolution of pixels
- we propose hierarchical Transformer
- shifted windows
- self-attention in local windows
- cross-window connection
- verified on
- classification:ImageNet top1 acc 86.4
- detection:COCO box-MAP 58.7
- segmentation:ADE20K
- this paper主要介绍分类,检测是以swin作为backbone,用MaskRCNN等二阶段架构来训练的,分割是以swin作为backbone,用UperNet去训练的,具体模型配置official repo的readme里面有详细列表
论点
- when transfer Transformer’s high performance in NLP domain to CV domain
- differences between the two modalities
- scale:NLP里面,word tokens serves as the basic element,但是CV里面,patch的形态大小都是可变的,previous methods里面,都是统一设定固定大小的patch token
- resolution:主要问题就是self-attention的计算复杂度,是image size的平方
- we propose Swin Transformer
- hierarchial feature maps
- linear computatoinal complexity to image size
- differences between the two modalities
- hierarchical
- start from small patches
- merge in deeper layers
- 所以对不同尺度的特征patch进行了融合
linear complexity
- compute self-attention locally in each window
- 每个window的number of patches是设定好的,window数是与image size成正比的
- 所以是线性
shifted window approach
- 跨层的window shift,建立起相邻window间的桥梁
【QUESTION】all query patches within a window share the same key set
previous attemptations of Transformer
- self-attention based backbone architectures
- 将部分/全部conv layers替换成self-attention
- 模型主体架构还是ResNet
- slightly better acc
- larger latency caused by self-att
- self-attention complement CNNs
- 作为additional block,给到backbone/head,提供长距离信息
- 有些检测/分割网络也开始用了transformer的encoder-decoder结构
- transformer-based vision backbones
- 主要就是ViT及其衍生品
- ViT requires large-scale training sets
- DeiT introduces training strategies
- 但是还存在high resolution计算量的问题
- self-attention based backbone architectures
- when transfer Transformer’s high performance in NLP domain to CV domain
方法
overview
- Swin-T:tiny version
- 第一步是patch partition:
- 将RGB图切成non-overlapping patches
- patches:token,basic element
- feature input dim:with patch size 4x4,dim=4x4x3=48
- 然后是linear embedding layer
- 将raw feature re-projection到指定维度
- 指定维度C:default=96
- 接下来是Swin Transformer blocks
- the number of tokens maintain
- patch merging layers负责reduce the number of tokens
- 第一个patch merging layer concat 所有2x2的neighbor patches:4C-dim vec each
- 然后用了一个线性层re-projection
- number of tokens(resolution):(H/4*W/4)/4 = (H/8*W/8),跟常规的CNN一样变化的
- token dims:2C
- 后面接上一个Transformer blocks
- 合起来叫stage2(stage3、stage4)
Swin Transformer blocks
跟原始的Transformer block比,就是把原始的MSA替换成了window-based的MSA
原始的attention:global computation leads to quadratic complexity
window-based attention:
- attention的计算只发生在每个window内部
- non-overlapping partition
- 很显然lacks connections across windows
shifted window partitioning in successive blocks
两个attention block
第一个用常规的window partitioning strategy:从左上角开始,take M=4,window size 4x4(一个window里面包含4x4个patch)
第二层的window,基于前一层,各平移M/2
introduce connections between neighbor non-overlapping windows in the previous layer
efficient computation
shifted window会导致window尺寸不一致,不利于并行计算
relative position bias
- 我们在MxM的window内部计算local attention:也就是input sequence的time-step是$M^2$
- Q、K、V $\in R ^ {M^2 d}$
- $Attention(Q,K,V)=Softmax(QK^T/\sqrt{d}+B)V$
- 这个B作为local的position bias,在二维上,在每个轴上的变化范围[-M+1,M-1]
- we parameterized a smaller-sized bias matrix $\hat B\in R ^{(2M-1)*(2M-1)}$
- values in $B \in R ^ {M^2*M^2}$ are taken from $\hat B$
- the learnt relative position bias可以用来initialize fine-tuned model
Architecture variants
base model:Swin-B,参数量对标ViT-B
Swin-T:0.25x,对标ResNet-50 (DeiT-S)
Swin-S:0.5x,对标ResNet-101
Swin-L:2x
window size:M=7
query dim:d=32,(每个stage的input sequence dim逐渐x2,heads num逐渐x2)
MLP:expansion ratio=4
channel number C:第一个stage的embdding dim,(后续逐渐x2)
hypers:
- drop_rate:0.0
- drop_path_rate:0.1
acc
official repo: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md
keras官方也出了一版:https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py
model zoo
model | resolution | C | num_layers | num_heads | window_size
Swin-T | 224 | 96 | {2,2,6,2} | {3,6,12,24} | 7
Swin-S | 224 | 96 | {2,2,18,2} | {3,6,12,24} | 7
Swin-B | 224/384 | 128 | {2,2,18,2} | {4,8,16,32} | 7/12
Swin-L | 224/384 | 192 | {2,2,18,2} | {6,12,24,48} | 7/12
models/build.py
- SwinTransformer & SwinMLP:前者就是论文里的,basic block是transformer的MSA加上MLP layers,后者是没用MSA,就用MLP来建模相邻windows之间的global relationship的,用的conv1d。
DETR: End-to-End Object Detection with Transformers
动机
- new task formulation:a direct set prediction problem
- main gradients
- a set-based global loss
- a transformer en-de architecture
- remove the hand-designed componets like nms & anchor
- acc & run-time on par with Faster R-CNN on COCO
- significantly better performance on large objects
- lower performances on small objects
论点
modern detectors run object detection in an indirect way
- 基于格子/anchor/proposals进行回归和分类
- 算法性能受制于nms机制、anchor设计、target-anchor的匹配机制
end-to-end approach
- transformer的self-attention机制,explicitly model all pairwise interactions between elements:内含了去重(nms)的能力
- bipartite matching:set loss function,将预测和gt的box一一匹配,run in parallel
- DETR does not require any customized layers, thus can be reproduced easily
- expand to segmentation task:a simple segmentation head trained on top of a pre-trained DETR
set prediction:to predict a set of bounding boxes and the categories for each
- basic:multilabel classification
- detection task has near-duplicates issues
- set prediction是postprocessing-free的,它的global inference schemes能够avoid redundancy
- usual loss:bipartite match
object detection
- set-based loss
- modern detectors use non-unique assignment rules together with NMS
- bipartite matching是target和pred一一对应
- set-based loss
方法
overall
- three main components
- a CNN backbone
- an encoder-decoder transformer
- a simple FFN
- three main components
backbone
- conventional r50
- input:$[H_0, W_0, 3]$
- output:$[H,W,C], H=\frac{H_0}{32}, W=\frac{W_0}{32}, C=2048$
transformer encoder
- reduce channel dim to $d$:1x1 conv,$d=512$
- collapse the spatial dimensions:feature sequence [d, HW],每个spatial pixel作为一个feature
- fixed positional encodings:
- added to the input of each attention layer
- 【QUESTION】加在K和Q上还是embedding上?
transformer decoder
- 输入N个dim=d的embedding
- 叫object queries:表示我们预测固定值N个目标
- 因为decoder也是permutation-invariant的(因为all shared),所以要输入N个不一样的embedding
- learnt positional encodings
- add them to the input of each attention layer
- decodes the N objects in parallel
- 输入N个dim=d的embedding
prediction FFN
- 3 layer,ReLU,
- box prediction:normalized center coords & height & width
- class prediction:
- an additional class label $\varnothing$ 表示no object
auxiliary losses
- each decoder layer后面都接一个FFN prediction和Hungarian loss
- shared FFN
- an additional shared LN to norm the inputs of FFN
- three components of the loss
- class loss:CE loss
- box loss
- GIOU loss
- L1 loss
technical details
- AdamW:
- initial transformer lr=10e-4
- initial backbone lr=10e-5
- weight decay=10e-4
- Xavier init
- imagenet-pretrained resnet weights with frozen batchnorm layers:r50 & r101,DETR & DETR-R101
- a variant:
- increase feature resolution version
- remove stage5’s stride and add a dilation
- DETR-DC5 & DETR-DC5-R101
- improve performance for small objects
- overall 2x computation increase
- augmentation
- resize input
- random crop:with 0.5 prob then resize
- transformer default dropout 0.1
- lr schedule
- 300 epochs
- drop by factor 10 after 200 epochs
- 4 images per GPU,total batch 64
- AdamW:
for segmentation task:全景分割
- 给decoder outputs加mask head
- compute multi-head attention among
- decoder box predictions
- encoder outputs
- generate M attention heatmaps per object
- add a FPN styled CNN to recover resolution
pixel-wise argmax
UNETR: Transformers for 3D Medical Image Segmentation
动机
- unet结构用于医学分割
- encoder learns global context
- decoder utilize the representations to predict the semanic ouputs
- the locality of CNN limits long-range spatial dependency
- our method
- use a pure transformer as the encoder
- learn sequence representations of the input volume
- global
- multi-scale
- encoder directly connects to decoder with skip connections
- unet结构用于医学分割
论点
- unet结构
- encoder用来提取全图特征
- decoder用来recover
- skip connections用来补充spatial information that is lost during downsampling
- localized receptive fields:
- disadvantage in capturing multi-scale contextual information
- 如不同尺寸的脑肿瘤
- 缓和手段:atrous convs,still limited
- transformer
- self-attention mechanism in NLP
- highlight the important features of word sequences
- learn its long-range dependencies
- in ViT
- an image is represented as a patch embedding sequence
- self-attention mechanism in NLP
- our method
- formulation
- 1D seq2seq problem
- use embedded patches
- the first completely transformer-based encoder
- formulation
- other unet- transformer methods
- 2D (ours 3D)
- employ only in the bottleneck (ours pure transformer)
- CNN & transformer in separate streams and fuse
- unet结构
方法
overview
transformer encoder
- input:1D sequence of input embeddings
- given 3D volume $x \in R^{HWDC}$
- divide into flattened uniform non-overlapping patches $x\in R^{LCN^3}$
- $L=HWD/N^3$:the sequence length
- $N^3$:patch dimension
- linear projection to K-dim $E \in R^{LCK}$:remain constant through transformer
- 1D learnable positional embedding $E_{pos} \in R^LD$
- 12 self-att blocks:MSA + MLP
- decoder &skip connections
- 选取encoder第{3,6,9,12}个block的输出
- reshape back to 3D volume $[\frac{H}{N},\frac{W}{N},\frac{D}{N},C]$
- consecutive 3x3x3 conv+BN+ReLU
- bottleneck
- deconv by 2 to increase resolution
- then concat with the previous resized feature
- then jointly consecutive conv
- then upsample with deconv…
- concat到原图resolution以后,consecutive conv以后,再1x1x1 conv+softmax
- loss
- dice loss
- dice:for each class channel,计算dice,然后求类平均
- 1-dice
- ce loss
- for each pixel,求bce,然后求所有pixel的平均
- dice loss