transformers

startup

reference1:https://mp.weixin.qq.com/s/Rm899vLhmZ5eCjuy6mW_HA

reference2:https://zhuanlan.zhihu.com/p/308301901

  1. NLP & RNN

    • 文本涉及上下文关系
    • RNN时序串行,建立前后关系
    • 缺点:对超长依赖关系失效,不好并行化
  2. NLP & CNN

    • 文本是1维时间序列
    • 1D CNN,并行计算
    • 缺点:CNN擅长局部信息,卷积核尺寸和长距离依赖的balance
  3. NLP & transformer

    • 对流入的每个单词,建立其对词库的权重映射,权重代表attention
    • 自注意力机制
    • 建立长距离依赖

  4. put in CV

    • 插入类似的自注意力层
    • 完全抛弃卷积层,使用Transformers
  5. RNN & LSTM & GRU cell

    • 标准要素:输入x、输出y、隐层状态h

    • RNN

      • RNN cell每次接收一个当前输入$x_t$,和前一步的隐层输出$h_{t-1}$,然后产生一个新的隐层状态$h_t$,也是当前的输出$y_t$
      • formulation:$y_t, h_t = f(x_t, h_{t-1})$
      • same parameters for each time step:同一个cell每个time step的权重共享
      • 一个问题:梯度消失/爆炸

        • 考虑hidden states’ chain的简化形式:$h_t = \theta^t h_0$,一个sequence forward下去就是same weights multiplied over and over again
        • 另外tanh也是会让神经元梯度消失/爆炸

    • LSTM

      • key ingredient

        • cell:增加了一条cell state workflow,优化梯度流
        • gate:通过门结构删选携带信息,优化长距离关联

      • 可以看到LSTM的循环状态有两个:细胞状态$c_t$和隐层状态$h_t$,输出的$y_t$仍旧是$h_t$

    • GRU

      • LSTM的变体,仍旧是门结构,比LSTM结构简单,参数量小,据说更好训练

  6. papers

    [一个列了很多论文的主页] https://github.com/dk-liang/Awesome-Visual-Transformer

    [经典考古]

    ​ * [Seq2Seq 2014] Sequence to Sequence Learning with Neural Networks,Google,最早的encoder-decoder stacking LSTM用于机翻

    ​ * [self-attention/Transformer 2017] Transformer: Attention Is All You Need,Google,

    ​ * [bert 2019] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,Google,NLP,输入single sentence/patched sentences,用Transformer encoder提取bidirectional cross sentence representation,用输出的第一个logit进行分类

    [综述]

    ​ * [综述2020] Efficient Transformers: A Survey,Google,

    ​ * [综述2021] Transformers in Vision: A Survey,迪拜,

    ​ * [综述2021] A Survey on Visual Transformer,华为,

    [classification]

    ​ * [ViT 2020] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,Google,分类任务,用transformer的encoder替换CNN再加分类头,每个feature patch作为一个input embedding,channel dim是vector dim,可以看到跟bert基本一样,就是input sequence换成patch,后续基于它的提升有DeiT、LV-ViT

    ​ * [BotNet 2021] Bottleneck Transformers for Visual Recognition,Google,将CNN backbone最后几个stage替换成MSA

    ​ * [CvT 2021] CvT: Introducing Convolutions to Vision Transformers,微软,

    ​ * [Swin 2021] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,微软

    ​ * [PVT2021] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,跟swin一样也是multi-scale features

    [detection]

    ​ * [DeTR 2020] DeTR: End-to-End Object Detection with Transformers,Facebook,目标检测,CNN+transformer(en-de)+预测头,每个feature pixel作为一个input embedding,channel dim是vector dim

    ​ * [Deformable DETR]

    ​ * [Anchor DETR]

    ​ * 详见《det-transformers》

    [segmentation]

    ​ * [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,复旦,水,感觉就是把FCN的back换成transformer

    [Unet+Transformer]:

    ​ * [UNETR 2021] UNETR: Transformers for 3D Medical Image Segmentation,英伟达,直接使用transformer encoder做unet encoder

    ​ * [TransUNet 2021] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,encoder stream里面加transformer block

    ​ * [TransFuse 2021] TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation,大学,CNN feature和Transformer feature进行bifusion

    ​ * 详见《seg-transformers》

Sequence to Sequence

  1. [a keras tutorial][https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html]

    • general case

      • extract the information of the entire input sequence
      • then start generate the output sequence
    • seq2seq model workflow

      • a (stacking of) RNN layer acts as encoder
        • processes the input sequence
        • returns its own internal state:不要RNN的outputs,只要internal states
        • encoder编码得到的东西叫Context Vector
      • a (stacking of) RNN layer acts as decoder
        • given previous characters of the target sequence
        • it is trained to predict the next characters of the target sequence
        • teacher forcing:
          • 输入是target sequence,训练目标是使模型输出offset by one timestep的target sequence
          • 也可以不teacher forcing:直接把预测作为next step的输入
        • Context Vector的同质性:每个step,decoder都读取一样的Context Vector作为initial_state
      • when inference
        • 第一步获取input sequence的state vectors
        • repeat
          • 给decoder输入input states和out sequence(begin with a 起始符)
          • 从prediction中拿到next character
        • append the character to the output sequence
        • until:得到end character / hit the character limit
    • implementation

      https://github.com/AmberzzZZ/transformer/blob/master/seq2seq.py

  2. one step further

    • 改进方向
      • bi-directional RNN:粗暴反转序列,有效涨点
      • attention:本质是将encoder的输出Context Vector加权
      • ConvS2S:还没看
    • 主要都是针对RNN的缺陷提出
  3. 动机

    • present a general end-to-end sequence learning approach
      • multi-layered LSTMs
      • encode the input seq to a fix-dim vector
      • decode the target seq from the fix-dim vector
    • LSTM did not have difficulty on long sentences
    • reversing the order of the words improved performance

  4. 方法

    • standard RNN

      • given a sequence $(x_1, x_2, …, x_T)$

      • iterating:

      • 如果输入、输出的长度事先已知且固定,一个RNN网络就能建模seq2seq model了

      • 如果输入、输出的长度不同、并且服从一些更复杂的关系?就得用两个RNN网络,一个将input seq映射成fixed-sized vector,另一个将vector映射成output seq,but long-term-dependency issue

    • LSTM

      • LSTM是始终带着全部seq的信息的,如上图那样
    • our actual model

      • use two LSTMs:encoder-decoder能够增加参数量
      • an LSTM with four layers:deeper
      • input sequence倒序:真正的句首更接近trans的句首,makes it easy for SGD to establish communication
    • training details

      • LSTM:4 layers,1000 cells
      • word-embedding:1000-dim,(input vocab 160,000, output vocab 80,000)
      • naive softmax
      • uniform initialization:(-0.08, 0.08)
      • SGD,lr=0.7,half by every half epoch,total 7.5 epochs
      • gradient norm [10, 25]
      • all sentences in a minibatch are roughly of the same length

Transformer: Attention Is All You Need

  1. 动机

    • sequence2sequence models
      • encoder + decoder
      • RNN / CNN + an attention path
    • we propose Transformer
      • base solely on attention mechanisms
      • more parallelizable and less training time
  2. 论点

    • sequence modeling
      • 主流:RNN,LSTM,gated
        • align the positions to computing time steps
        • sequential本质阻碍并行化
      • Attention mechanisms acts as a integral part
        • in previous work used in conjunction with the RNN
      • 为了并行化
        • some methods use CNN as basic building blocks
        • difficult to learn dependencies between distant positions
    • we propose Transformer
      • rely entirely on an attention mechanism
      • draw global dependencies
    • self-attention
      • relating different positions of a single sequence
      • to generate a overall representation of the sequence
  3. 方法

    • encoder-decoder

      • encoder:doc2emb
        • given an input sequence of symbol representation $(x_1, x_2, …, x_n)$
        • map to a sequence of continuous representations $(z_1, z_2, …, z_n)$,(embeddings)
      • decoder:hidden layers
        • given embeddings z
        • generate an output sequence $(y_1, y_2, …, y_m)$ one element at a time
        • the previous generated symbols are served as additional input when computing the current time step
    • Transformer Architecture

      • Transformer use

        • for both encoder and decoder
        • stacked self-attention and point-wise fully-connected layers

      • encoder

        • N=6 identical layers
        • each layer has 2 sub-layers
          • multi-head self-attention mechanism
          • postision-wise fully connected layer
        • residual
          • for two sub-layers independently
          • add & layer norm
        • d=512
      • decoder

        • N=6 identical layers
        • 3 sub-layers
          • [new] masked multi-head self-attention:combine了先验知识,output embedding只能基于在它之前的time-step的embedding计算
          • multi-head self-attention mechanism
          • postision-wise fully connected layer
        • residual
      • attention

        • reference:https://bbs.cvmart.net/articles/4032
        • step1:project embedding to query-key-value pairs
          • $Q = W_Q^{dd} A^{dN}$
          • $K = W_K^{dd} A^{dN}$
          • $V = W_V^{dd} A^{dN}$
        • step2:scaled dot-product attention
          • $A^{N*N}=softmax(K^TQ/\sqrt{d})$
          • $B^{dN} = V^{dN}A^{N*N}$
        • multi-head attention
          • 以上的step1&step2操作performs a single attention function
          • 事实上我们可以用多组projection得到多组$\{Q,K,V\}^h$,in parallel地执行attention运算,得到多组$\{B^{d*N}\}^h$
          • concat & project
            • concat in d-dim:$B\in R^{(dh)N}$
            • linear project:$out = W^{d(dh)} B$
          • h=8
          • $d_{in}/h=64$:embedding的dim
          • $d_{out}=64$:query-key-value的dim
      • positional encoding

        • 数学本质是一个hand-crafted的映射矩阵$W^P$和one-hot的编码向量$p$:

        • 用PE表示e

          • pos是sequence x上的position
          • 2i和2i+1是embedding a上的idx
      • point-wise feed-forward network

        • fc-ReLU-fc
        • dim_fc=2048
        • dim_in & dim_out = 512
    • 运行过程

      • encoder是可以并行计算的

        • 输入是sequence embedding和positional embedding:$A\in R^{d*N}$
        • 经过repeated blocks
        • 输出是另外一个sequence:$B\in R^{d*N}$
        • self-attention:Q、K、V是一个东西
        • encoder的本质就是在解析自注意力:
        • 并行的全局两两比较,一步到位
          • RNN要by step
        • CNN要stack layers
  • decoder是在训练阶段是可以并行的,在inference阶段by step

    • 输入是encoder的输出和上一个time-step decoder的输出embedding

    • 输出是当前time-step对应position的输出词的概率

    • 第一个attention layer是out embedding的self-attention:要实现像RNN一样依次解码出来,每个time step要用到上一个位置的输出作为输入——masking

      • given输入sequence是\ I have a cat,5个元素
      • 那么mask就是$R^{5*5}$的下三角矩阵
      • 输入embedding经过transformation变成Q、K、V三个矩阵

      • 仍旧是$A=K^TQ$计算attention

      • 这里有一些attention是非法的:位置靠前的query只能用到比他位置更靠前的query,因此要乘上mask矩阵:$A=M A$

      • softmax:$A=softmax(A)$

      • scale:$B = VA$

      • concat & projection

        • 第二个attention layer是in & out sequence的注意力,其key和value来自encoder,query来自上一个decoder block的输出

  1. why self-attention

    • 衡量维度
      • total computational complexity per layer
      • amount of computation that can be parallelized
      • path-length between long-range dependencies
    • given input sequence with length N & dim $d_{in}$,output sequence with dim $d_{out}$
      • RNN need N sequencial operations of $W\in R^{d_{in} * d_{out}}$
      • CNN need N/k stacking layers of $d_{in}d_{out}$ sequence operations of $W\in R^{kk}$,generally是RNN的k倍
  2. training

    • optimizer:$Adam(lr, \beta_1=0.9, \beta_2=0.98, \epsilon=10^{-9})$
    • lrschedule:warmup by 4000 steps,then decay
    • dropout

      • residual dropout:就是stochastic depth
      • dropout to the sum of embeddings & PE for both encoder and decoder
      • drop_rate = 0.1
    • label smoothing:smooth_factor = 0.1

  3. 实验

    • A:vary the number of attention heads,发现多了少了都hurts
    • B:reduce the dim of attention key,发现hurts
    • C & D:大模型+dropout helps
    • E:learnable & sincos PE:nearly identical
    • 最后是big model的参数

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  1. 动机

    • BERT:Bidirectional Encoder Representations from Transformers
      • Bidirectional
      • Encoder
      • Representations
      • Transformers
    • workflow
      • pretrain bidirectional representations from unlabeled text
      • tune with one additional output layer to obtain the model
    • SOTA
      • GLUE score 80.5%
  2. 论点

    • pretraining is effective in NLP tasks
      • feature-based method:use task-specfic architectures,仅使用pretrained model的特征
      • fine-tuining method:直接fine-tune预训练模型
      • 两种方法在预训练阶段训练目标一致:use unidirectional language models to learn general language representations
      • reduce the need for many heavily-engineered task- specific architectures
    • current methods’ limitations
      • unidirectional:
        • limit the choice of architectures
        • 事实上token的上下文都很重要,不能只看上文
      • 简单的concat两个independent的L2R和R2L模型(biRNN)
        • independent
        • shallow concat
    • BERT
      • masked language model:在一个sequence中预测被遮挡的词
      • next sentence prediction:trains text-pair representations
  3. 方法

    • two steps

      • pre-training
        • unlabeled data
        • different pretraining tasks
      • fine-tuning
        • labeled data of the downstream tasks
        • fine-tune all the params
      • 两个阶段的模型,只有输出层不同

        • 例如问答模型
        • pretraining阶段,输入是两个sentence,输入的起始有一个CLS symbol,两个句子的分隔有一个SEP symbol
        • fine-tuning阶段,输入分别是问和答,【输出是啥?】

    • architecture

      • multi-layer bidirectional Transformer encoder

        • number of transfomer blocks L
        • hidden size H
        • number of self-attention heads A
        • FFN dim 4H
      • Bert base:L=12,H=768,A=12

      • Bert large:L=24,H=1024,A=16

      • input/output representations

        • a single sentence / two packed up sentence:
          • 拼接的sentence用特殊token SEP衔接
          • segment embedding:同时add a learned embedding to every token indicating who it belongs
        • use WordPiece embeddings with 30000 token vocabulary
        • 输入sequence的第一个token永远是一个特殊符号CLS,它对应的final state输出作为sentence整体的representation,用于分类任务
        • overall网络的input representation是通过将token embeddings拼接上上特殊符号,加上SE和PE得到

    • pre-training

      • two unsupervised tasks
        • Masked LM (MLM)
          • mask some percentage of the input tokens at random:15%
            • 80%的概率用MASK token替换
            • 10%的概率用random token替换
            • 10%的概率unchanged
          • then predict those masked tokens
          • the final hidden states corresponding to the masked tokens are fed into a softmax
          • 相比较于传统的left2right/right2left/concat模型
            • 既有前文又有后文
            • 只预测masked token,而不是全句预测
        • Next Sentence Prediction (NSP)
          • 对于relationship between sentences:
            • 例如question&answer,句子推断
            • not direatly captured by language modeling,模型直观学习的是token relationship
          • binarized next sentence prediction task
            • 选取sentence A&B:
              • 50%的概率是真的上下文(IsNext)
              • 50%的概率是random(NotNext)
            • 构成了一个二分类问题:仍旧用CLS token对应的hidden state C来预测
    • fine-tuning

      • BERT兼容many downstream tasks:single text or text pairs
      • 直接组好输入,end-to-end fine-tuning就行
      • 输出还是用CLS token对应的hidden state C来预测,接分类头

A Survey on Visual Transformer

  1. 动机

    • provide a comprehensive overview of the recent advances in visual transformers
    • discuss the potential directions for further improvement
    • develop timeline

  2. 按照应用场景分类

    • backbone:分类
    • high/mid-level vision:通常是语义相关的,检测/分割/姿态估计
    • low-level vision:对图像本身进行操作,超分/图像生成,目前应用较少
    • video processing

  3. revisiting transformer

    • key-concepts:sentence、embedding、positional encoding、encoder、decoder、self-attention layer、encoder-decoder attention layer、multi-head attention、feed-forward neural network

    • self-attention layer

      • input vector is transformed into 3 vectors
        • input vector is embedding+PE(pos,i):pos是word在sequence中的位置,i是PE-element在embedding vec中的位置
        • query vec q
        • key vec k
        • value vec v
        • $d_q = d_k = d_v = d_{model} = 512$
      • then calculate:$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$
      • encoder-decoder attention layer
        • K和V是从encoder中拿到
        • Q是从前一层拿到
        • 计算是相似的
    • multi-head attention
      • 一个attention是一个softmax,对应了一对强相关,同时抑制了其他word的相关性
      • 考虑一个词往往与几个词强相关,这就需要多个attention
      • multi-head:different QKV matrices are used for different heads
      • given a input vector,the number of heads h
        • 先产生h个 pairs
        • $d_q=d_k=d_v=d_{model}/h=64$
        • 这h个pair,分别计算attention vector,得到h个[b,d]的context vector
        • concat along-d-axis and linear projection to final [b,d] vector
    • residual & layer-norm:layer-norm在residual-add以后
    • feed-forward network
      • fc-GeLU-fc
      • $d_h=2048$
    • final-layer in decoder
      • dense+softmax
      • $d_{words}=$ number of words in the vocabulary
    • when applied in CV tasks
      • most transformers adopt the original transformer’s encoder module
      • used as a feature selector
      • 相比较于CNN,能够capture long-distance characteristics,derive global information
      • 相比较于RNN,能够并行计算
    • 计算量
      • 首先是三个线性层:线性时间复杂度O(n),计算量与$d_{model}$成正比
      • 然后是self-attention层:QKV矩阵乘法运算,平方时间复杂度O(n^2)
      • multi-head的话,还有一个线性层:平方时间复杂度O(n^2)
  4. revisiting transformers for NLP

    • 最早期的RNN + attention:rnn的sequential本质影响了长距离/并行化/大模型
    • transformer的solely attention结构:解决以上问题,促进了large pre-trained models (PTMs) for NLP

    • BERT and its variants

      • are a series of PTMs built on the multi-layer transformer encoder architecture
      • pre-trained
        • Masked language modeling
        • Next sentence prediction
      • fine-tuned
        • add an output layer
    • Generative Pre-trained Transformer models (GPT)
      • are another type of PTMs based on the transformer decoder architecture
      • masked self-attention mechanisms
      • pre-trained
        • 与BERT最大的不同是有向性
  5. visual transformer

    • 【category1】: backbone for image classification

      • transformer的输入是tokens,在NLP里是embedding形式的分词序列,在CV里就是representing a certain semantic concept的visual token
        • visual token可以来自CNN的feature
        • 也可以直接来自image的小patch
      • purely use transformer来做image classification任务的模型有iGPT、ViT、DeiT

      • iGPT

        • pretraining stage + finetuning stage
        • pre-training stage
          • self-supervised:自监督,所以结果较差
          • given an unlabeled dataset
          • train the model by minimizing the -log(density),感觉是在force光栅排序正确
        • fine-tuning stage
          • average pool + fc + softmax
          • jointly train with L_gen & L_CE
      • ViT
        • pre-trained on large datasets
          • standard transformer’s encoder + MLP head
          • treats all patches equally
          • 有一个类似BERT class token的东西
            • 从训练的角度,gather knowledge of the entire class
            • inference的时候,只拿了这第一个logit用来做预测
        • fine-tuning
          • 换一个zero-initialized的MLP head
          • use higher resolution & 插值pe
      • DeiT
        • Data-efficient image transformer
        • better performance with
          • a more cautious training strategy
          • and a token-based distillation
    • 【category2】: High/Mid-level Vision

    • 【category3】: Low-level Vision

    • 【category4】: Video Processing

    • efficient transformer:瘦身&加速

      • Pruning and Decomposition
      • Knowledge Distillation
      • Quantization
      • Compact Architecture Design

ViT: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

  1. 动机

    • attention in vision
      • either in conjunction with CNN
      • or replace certain part of a CNN
      • overall都还是CNN-based
    • use a pure transformer to sequence of image patches
    • verified on image classification tasks in supervised fashion
  2. 论点

    • transformer lack some inductive biases inherent to CNNs,所以在insufficient data上not generalize well
    • however large scale training trumps inductive bias,大数据集上ViT更好
    • naive application of self-attention
      • 建立pixel之间的两两关联:计算量太大了
      • 需要approximation:local/改变size
    • we use transformer
      • wih global self-attention
      • to full-sized images
  3. 方法

    • input 1D-embedding sequence

      • 将image $x\in R^{HWC}$ 展开成patches $\{x_p \in R^{P^2C}\}$
      • thus sequence length $N=HW/P^2$
      • patch embedding:
        • use a trainable linear projection
        • fixed dimension size through-all
      • position embedding:
        • add to patch embedding
        • standard learnable 1D position embedding
      • prepended embedding:
        • 前置的learnable embedding $x_{class}$
        • similar to BERT’s class token
      • 以上三个embedding组合起来,作为输入sequence
    • transformer encoder

      • follow the original Transformer
      • 交替的MSA和MLP
      • layer norm LN
      • residual
      • GELU

    • hybrid architecture

      • input sequence也可以来源于CNN的feature maps
      • patch size可以是1x1
    • classification head

      • attached to $z_L^0$:是class token用来做预测
      • pre-training的时候是MLP
      • fine-tuning的时候换一个zero-initialized的single linear layer
    • workflow

      • typically先pre-train on large datasets
      • 再fine-tune to downstream tasks
      • fine-tune的时候替换一个zero-initialized的新线性分类头
      • when feeding images with higher resolution
        • keep the patch size
        • results in larger sequence length
        • 这时候pre-trained PE就no longer meaningful了
        • we therefore perform 2D interpolation基于它在原图上的位置
    • training details

      • Adam:$\beta_1=0.9,\beta_2=0.999$
      • batch size 4096
      • high weight decay 0.1
      • linear lr warmup & decay
    • fine-tuning details

      • SGDM
      • cosine LR
      • no weight decay
      • 【????】average 0.9999

win Transformer: Hierarchical Vision Transformer using Shifted Windows

  1. 动机

    • use Transformer as visual tasks’ backbone
    • challenges of Transformer in vision domain
      • large variations of scales of the visual entities
      • high resolution of pixels
    • we propose hierarchical Transformer
      • shifted windows
      • self-attention in local windows
      • cross-window connection
    • verified on
      • classification:ImageNet top1 acc 86.4
      • detection:COCO box-MAP 58.7
      • segmentation:ADE20K
      • this paper主要介绍分类,检测是以swin作为backbone,用MaskRCNN等二阶段架构来训练的,分割是以swin作为backbone,用UperNet去训练的,具体模型配置official repo的readme里面有详细列表
  2. 论点

    • when transfer Transformer’s high performance in NLP domain to CV domain
      • differences between the two modalities
        • scale:NLP里面,word tokens serves as the basic element,但是CV里面,patch的形态大小都是可变的,previous methods里面,都是统一设定固定大小的patch token
        • resolution:主要问题就是self-attention的计算复杂度,是image size的平方
      • we propose Swin Transformer
        • hierarchial feature maps
        • linear computatoinal complexity to image size
    • hierarchical
      • start from small patches
      • merge in deeper layers
      • 所以对不同尺度的特征patch进行了融合
    • linear complexity

      • compute self-attention locally in each window
      • 每个window的number of patches是设定好的,window数是与image size成正比的
      • 所以是线性

    • shifted window approach

      • 跨层的window shift,建立起相邻window间的桥梁
      • 【QUESTION】all query patches within a window share the same key set

    • previous attemptations of Transformer

      • self-attention based backbone architectures
        • 将部分/全部conv layers替换成self-attention
        • 模型主体架构还是ResNet
        • slightly better acc
        • larger latency caused by self-att
      • self-attention complement CNNs
        • 作为additional block,给到backbone/head,提供长距离信息
        • 有些检测/分割网络也开始用了transformer的encoder-decoder结构
      • transformer-based vision backbones
        • 主要就是ViT及其衍生品
        • ViT requires large-scale training sets
        • DeiT introduces training strategies
        • 但是还存在high resolution计算量的问题
  3. 方法

    • overview

      • Swin-T:tiny version
      • 第一步是patch partition:
        • 将RGB图切成non-overlapping patches
        • patches:token,basic element
        • feature input dim:with patch size 4x4,dim=4x4x3=48
      • 然后是linear embedding layer
        • 将raw feature re-projection到指定维度
        • 指定维度C:default=96
      • 接下来是Swin Transformer blocks
        • the number of tokens maintain
      • patch merging layers负责reduce the number of tokens
        • 第一个patch merging layer concat 所有2x2的neighbor patches:4C-dim vec each
        • 然后用了一个线性层re-projection
        • number of tokens(resolution):(H/4*W/4)/4 = (H/8*W/8),跟常规的CNN一样变化的
        • token dims:2C
        • 后面接上一个Transformer blocks
        • 合起来叫stage2(stage3、stage4)
    • Swin Transformer blocks

      • 跟原始的Transformer block比,就是把原始的MSA替换成了window-based的MSA

      • 原始的attention:global computation leads to quadratic complexity

      • window-based attention:

        • attention的计算只发生在每个window内部
        • non-overlapping partition
        • 很显然lacks connections across windows
      • shifted window partitioning in successive blocks

        • 两个attention block

        • 第一个用常规的window partitioning strategy:从左上角开始,take M=4,window size 4x4(一个window里面包含4x4个patch)

        • 第二层的window,基于前一层,各平移M/2

        • introduce connections between neighbor non-overlapping windows in the previous layer

        • efficient computation

          • shifted window会导致window尺寸不一致,不利于并行计算

      • relative position bias

        • 我们在MxM的window内部计算local attention:也就是input sequence的time-step是$M^2$
        • Q、K、V $\in R ^ {M^2 d}$
        • $Attention(Q,K,V)=Softmax(QK^T/\sqrt{d}+B)V$
        • 这个B作为local的position bias,在二维上,在每个轴上的变化范围[-M+1,M-1]
        • we parameterized a smaller-sized bias matrix $\hat B\in R ^{(2M-1)*(2M-1)}$
        • values in $B \in R ^ {M^2*M^2}$ are taken from $\hat B$
        • the learnt relative position bias可以用来initialize fine-tuned model
    • Architecture variants

      • base model:Swin-B,参数量对标ViT-B

      • Swin-T:0.25x,对标ResNet-50 (DeiT-S)

      • Swin-S:0.5x,对标ResNet-101

      • Swin-L:2x

      • window size:M=7

      • query dim:d=32,(每个stage的input sequence dim逐渐x2,heads num逐渐x2)

      • MLP:expansion ratio=4

      • channel number C:第一个stage的embdding dim,(后续逐渐x2)

      • hypers:

        • drop_rate:0.0
        • drop_path_rate:0.1

      • acc

  4. official repo: https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md

    • keras官方也出了一版:https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py

    • model zoo

      model | resolution | C | num_layers | num_heads | window_size

      Swin-T | 224 | 96 | {2,2,6,2} | {3,6,12,24} | 7

      Swin-S | 224 | 96 | {2,2,18,2} | {3,6,12,24} | 7

      Swin-B | 224/384 | 128 | {2,2,18,2} | {4,8,16,32} | 7/12

      Swin-L | 224/384 | 192 | {2,2,18,2} | {6,12,24,48} | 7/12

    • models/build.py

      • SwinTransformer & SwinMLP:前者就是论文里的,basic block是transformer的MSA加上MLP layers,后者是没用MSA,就用MLP来建模相邻windows之间的global relationship的,用的conv1d。

DETR: End-to-End Object Detection with Transformers

  1. 动机

    • new task formulation:a direct set prediction problem
    • main gradients
      • a set-based global loss
      • a transformer en-de architecture
      • remove the hand-designed componets like nms & anchor
    • acc & run-time on par with Faster R-CNN on COCO
      • significantly better performance on large objects
      • lower performances on small objects
  2. 论点

    • modern detectors run object detection in an indirect way

      • 基于格子/anchor/proposals进行回归和分类
      • 算法性能受制于nms机制、anchor设计、target-anchor的匹配机制
    • end-to-end approach

      • transformer的self-attention机制,explicitly model all pairwise interactions between elements:内含了去重(nms)的能力
      • bipartite matching:set loss function,将预测和gt的box一一匹配,run in parallel
      • DETR does not require any customized layers, thus can be reproduced easily
      • expand to segmentation task:a simple segmentation head trained on top of a pre-trained DETR
    • set prediction:to predict a set of bounding boxes and the categories for each

      • basic:multilabel classification
      • detection task has near-duplicates issues
      • set prediction是postprocessing-free的,它的global inference schemes能够avoid redundancy
      • usual loss:bipartite match
    • object detection

      • set-based loss
        • modern detectors use non-unique assignment rules together with NMS
        • bipartite matching是target和pred一一对应
  3. 方法

    • overall

      • three main components
        • a CNN backbone
        • an encoder-decoder transformer
        • a simple FFN
    • backbone

      • conventional r50
      • input:$[H_0, W_0, 3]$
      • output:$[H,W,C], H=\frac{H_0}{32}, W=\frac{W_0}{32}, C=2048$
    • transformer encoder

      • reduce channel dim to $d$:1x1 conv,$d=512$
      • collapse the spatial dimensions:feature sequence [d, HW],每个spatial pixel作为一个feature
      • fixed positional encodings:
        • added to the input of each attention layer
        • 【QUESTION】加在K和Q上还是embedding上?
    • transformer decoder

      • 输入N个dim=d的embedding
        • 叫object queries:表示我们预测固定值N个目标
        • 因为decoder也是permutation-invariant的(因为all shared),所以要输入N个不一样的embedding
        • learnt positional encodings
        • add them to the input of each attention layer
      • decodes the N objects in parallel
    • prediction FFN

      • 3 layer,ReLU,
      • box prediction:normalized center coords & height & width
      • class prediction:
        • an additional class label $\varnothing$ 表示no object
    • auxiliary losses

      • each decoder layer后面都接一个FFN prediction和Hungarian loss
      • shared FFN
      • an additional shared LN to norm the inputs of FFN
      • three components of the loss
        • class loss:CE loss
        • box loss
          • GIOU loss
          • L1 loss
    • technical details

      • AdamW:
        • initial transformer lr=10e-4
        • initial backbone lr=10e-5
        • weight decay=10e-4
      • Xavier init
      • imagenet-pretrained resnet weights with frozen batchnorm layers:r50 & r101,DETR & DETR-R101
      • a variant:
        • increase feature resolution version
        • remove stage5’s stride and add a dilation
        • DETR-DC5 & DETR-DC5-R101
        • improve performance for small objects
        • overall 2x computation increase
      • augmentation
        • resize input
        • random crop:with 0.5 prob then resize
      • transformer default dropout 0.1
      • lr schedule
        • 300 epochs
        • drop by factor 10 after 200 epochs
      • 4 images per GPU,total batch 64
    • for segmentation task:全景分割

      • 给decoder outputs加mask head
      • compute multi-head attention among
        • decoder box predictions
        • encoder outputs
      • generate M attention heatmaps per object
      • add a FPN styled CNN to recover resolution
      • pixel-wise argmax

UNETR: Transformers for 3D Medical Image Segmentation

  1. 动机

    • unet结构用于医学分割
      • encoder learns global context
      • decoder utilize the representations to predict the semanic ouputs
      • the locality of CNN limits long-range spatial dependency
    • our method
      • use a pure transformer as the encoder
      • learn sequence representations of the input volume
      • global
      • multi-scale
      • encoder directly connects to decoder with skip connections
  2. 论点

    • unet结构
      • encoder用来提取全图特征
      • decoder用来recover
      • skip connections用来补充spatial information that is lost during downsampling
      • localized receptive fields:
        • disadvantage in capturing multi-scale contextual information
        • 如不同尺寸的脑肿瘤
        • 缓和手段:atrous convs,still limited
    • transformer
      • self-attention mechanism in NLP
        • highlight the important features of word sequences
        • learn its long-range dependencies
      • in ViT
        • an image is represented as a patch embedding sequence
    • our method
      • formulation
        • 1D seq2seq problem
        • use embedded patches
      • the first completely transformer-based encoder
    • other unet- transformer methods
      • 2D (ours 3D)
      • employ only in the bottleneck (ours pure transformer)
      • CNN & transformer in separate streams and fuse
  3. 方法

    • overview

    • transformer encoder

      • input:1D sequence of input embeddings
      • given 3D volume $x \in R^{HWDC}$
      • divide into flattened uniform non-overlapping patches $x\in R^{LCN^3}$
        • $L=HWD/N^3$:the sequence length
        • $N^3$:patch dimension
      • linear projection to K-dim $E \in R^{LCK}$:remain constant through transformer
      • 1D learnable positional embedding $E_{pos} \in R^LD$
      • 12 self-att blocks:MSA + MLP
    • decoder &skip connections
      • 选取encoder第{3,6,9,12}个block的输出
      • reshape back to 3D volume $[\frac{H}{N},\frac{W}{N},\frac{D}{N},C]$
      • consecutive 3x3x3 conv+BN+ReLU
      • bottleneck
        • deconv by 2 to increase resolution
        • then concat with the previous resized feature
        • then jointly consecutive conv
        • then upsample with deconv…
      • concat到原图resolution以后,consecutive conv以后,再1x1x1 conv+softmax
    • loss
      • dice loss
        • dice:for each class channel,计算dice,然后求类平均
        • 1-dice
      • ce loss
        • for each pixel,求bce,然后求所有pixel的平均