diffusion

发表于 2023-05-17 |

papers

[survey 2023] Diffusion Models: A Comprehensive Survey of Methods and Applications

[VAE]

[DDPM 2020] Denoising Diffusion Probabilistic Models

[LDMs 2022] High-Resolution Image Synthesis with Latent Diffusion Models

aigc Text-to-Image 大模型

Stable Diffusion
DALLE
GLIDE

微调方式

ControlNet
Textual Inversion
Hypernetworks
Lora

其他生成模型：

GAN
VAE
Autoregressive model
Normalizing flow
Energy-based model

Vision Tasks

repaint：图像补全修复
GLIDE：文本到图像的生成

生成式建模的basic idea：正向扩散来系统地扰动数据中的分布，然后通过学习反向扩散过程恢复数据的分布

有监督的生成模型与判别模型
- 有监督学习是根据样本和标签{X,Y}去学习一个模型：$Y=f(X)$ / $p(Y|X)$，根据监督方式又分为判别模型和生成模型
- 判别模型：直接对条件概率分布建模，监督给定X预测出的Y的质量，来拟合真实分布，所有的有监督的回归、分类等模型都是判别模型
- 生成模型：建模联合概率分布$p(X,Y)$，间接计算后验概率，常见方法有朴素贝叶斯(Naive Bayes)、混合高斯模型(GMM)、隐马尔科夫模型(HMM)、GAN的生成器
无监督的生成模型
- 对输入样本X的分布建模$p(X)$，希望产生与训练集同分布的新样本
- 概率模型的输出接近X的真实分布，就可以从概率模型中采样来“生成”样本了
极大似然估计
- 极大似然估计是对概率模型参数进行估计的一种方法
- 假设训练数据服从$P_{data}(X)$，生成模型的输出分布为$P_g(X,\theta)$，可以得到关于模型参数$\theta$的函数，$L(\theta) = \Pi_i^N p_g(x_i,\theta)$
  - 最好的$\theta$下产生数据集中的所有样本的概率是最大的，$\theta = argmax L(\theta)$
  - 为了避免多个概率的乘积发生数值下溢问题，采用对似然函数取对数的形式，$\theta = argmax\ log(L(\theta))$
  - 使用极大似然估计时，每个样本都希望拉高它所对应的模型概率值，但是所有样本的概率密度总和为1，一个样本点的概率密度函数值被拉高将不可避免的使其它点的函数值被拉低，最终达到一个相对平衡的状态
- 也可以直接建模成预测分布和真实分布的KL散度
  - $\theta = argmax\ D_{KL}(p_{data} | p_g) = argmax \ p_g (log\ p_{data}-log\ p_g) = argmax \ log\ p_g $
  - 和上面的形式一样

VAE：变分自编码器
- V：Variational，变分推断
- AE：Auto-Encoder
  - encoder-decoder形式的重建模型
    - encode：从输入x得到Latent Variable z
    - decode：从隐变量z重建x
    - 重建loss就是重构像素点误差，可以用MSE
  - AE的局限是泛化性，仅根据已有分布重建，latent space太窄了，随机采样的latent code就是乱码了，这种性质其实很适合异常检测（GANomaly）
  - 于是想到在latent code上面添加噪声，扩展latent space，一个直观的做法就是将latent code扩展成一个高斯分布，真实编码附近有一个高的概率值，远的地方概率越来越低，将一个单点扩展到整个空间
- VAE
  - encoder的输出不再是隐向量z，而是一个概率分布，用均值m和方差v来建模，除此以外还添加额外的高斯噪声e，间接得到latent code $z=exp(v)*e+m$
  - loss包含重建loss + 一个辅助loss
    - 辅助loss = $e^v-(1+v) + m^2$，他对v的倒数是$e^v-1$，在v=0时取得minimum
    - 防止v退化成-inf，使得VAE退化成AE
  - 最终我们得到了一个decoder，只要我们在给定input的latent分布周围随机采样，就能得到类似input的样本

DDPM
- 扩散模型包括 前向扩散过程 和 反向去噪过程（采样），前向阶段对图像逐步施加噪声，直至图像被破坏变成完全的高斯噪声，然后在反向阶段学习从高斯噪声还原为原始图像的过程，最终我们可以得到一个从纯噪声生成图片的模型
- 前向过程（扩散）
  - 逐步向真实图片添加噪声最终得到一个纯噪声
  - $x_t = \sqrt {\alpha_t} x_{t-1} + \sqrt {1-\alpha_t} \varepsilon_t$
    - $\varepsilon_t$是满足正态分布的随机噪声
    - $\sqrt {\alpha_t}$是图片权重，1-*是噪声权重
    - $\alpha_t = 1-\beta_t$是固定的已知函数，可以直接获得的，$\beta$通常很小，accumulate $\alpha$ 逐渐变小
    - $\overline \alpha = \Pi_i^t \alpha_i$ 此前所有$\alpha$的累积
  - $x_t = \sqrt {\overline \alpha_t} x_{0} + \sqrt {1-\overline \alpha_t} \varepsilon_t$
- 反向过程（去噪）
  - 训练网络去分解每一步的噪声，Unet，得到$\varepsilon_{\theta}(x_t,t)$
  - target是使预测噪声与真实噪声接近
- 使用模型
  - 使用一个随机噪声，以及trained Unet噪声模型，逐步还原成一张图片
- diffusion models的缺点
  - 去噪过程非常耗时，并且非常消耗内存
  - 因为训的是原始图片，训练的极其缓慢，LDM的改进办法就是将训练放在latent space，用预训练的大模型将原图投影到latent space

Latent Diffusion Models
- 将训练搬到了latent space
- 支持general consitioning inputs：text / bbox，用cross-attention实现多模态信息的注入
- overview
  - VAE：$E$ 是encoder，用原图编码到latent space，$D$是decoder，将生成的z解码到像素space
  - Condition-Encoder：编码多模态信息，text的话就用bert/clip
  - UNet with cross attn：cross attn融入condition信息，指导图像生成
    - ResBlock
      - 输入是latent feature和timestep_embedding
      - timestep_embedding是将timestep用positional embedding类似的编码方式得到的
    - SpatialTransformer
      - standard transformer block：self attn - cross attn - ffn

Stable Diffusion：https://jalammar.github.io/illustrated-stable-diffusion/
deeplearning.AI：https://learn.deeplearning.ai/diffusion-models
DDIM
- sampling is slow，DDIM faster the process by skiping timesteps
- sampling in latent space z

SSIS methods

发表于 2023-05-16 |

papers

[M2F] Masked-attention Mask Transformer for Universal Image Segmentation

[VITA 2022] VITA: Video Instance Segmentation via Object Token Association

[MobileViT v2] Separable Self-attention for Mobile Vision Transformers

VITA: Video Instance Segmentation via Object Token Association

introduction
- explicit object-oriented information can be a strong clue for understanding the context of the entire sequence
- build video-level understanding through frame-level object tokens
  - 不是用spatio-temporal structure，efficiency
  - 也不是用frame-attention，long-term adaptive
- built on top of an off-the-shelf Transformer-based frame-level detector
  - 可以用图像预训练一个detector，然后冻结去训练video
  - 可以handle long and high-resolution videos
- VIS methods
  - onlineVIS：一般是ref和cur通过association的方式，measure the similarities/correspondence
  - offlineVIS：involve整个video来预测，帧太多spatio-temporal attention太heavy，一般放在decoder上，instance-level的attn
method
- Frame-level Detector
  - m2f
  - frame-level predictions
    - dynamic 1×1 convolutional weight from the frame queries $f \in R^{NC}$
    - per-pixel embeddings from the pixel decoder $M \in R^{CHW}$
  - 用frame queries去decode encoder feature，实现classifying&segmenting their matched objects
- VITA Head
  - input：
    - collect the object-level frame queries throughout the whole video，$\{f_t \}^T_{t=1} \in R^{C(TN)}$
    - frame features $\{M_t\}_{t=1}^T \in R^{CTHW}$
  - Object Encoder
    - build video-level information
    - window-shifted self-attention along the temporal axis
    - 这一步是为了将query-level的信息交互聚合成object-level的信息，在temporal上交互
  - Object Decoder and Output heads
    - 假设object tokens can provide sufficient instance-specific information
    - 引入trainable video queries $v\in R^{CN_v}$作为Q
    - 前面的T frame queries $tf \in R^{C(TN_f)}$作为KV
    - 一样的几层cross+self attn block
    - final predictions
      - class embedding：$p \in R^{N_v (K+1)}$
      - mask embedding：$m \in R^{N_v C}$
      - 每个query代表的是tracklet of an instance over all frames
      - 最终的predicted mask logits用mask embedding和frame masks做matmul得到
- Clip-wise losses
  - similarity loss
    - frame-level matching的query id是变化的
    - video-level mathcing的query id是绑定到固定instance的
    - collect所有match到gt的frame queries和video queries，然后再进行query之间的pairing
      - similarity pred：embed the collected queries by a linear layer，then measure the similarity by matmul
      - similarity target：match到同一个gt instance的为1，否则为0，构建similarity matrix
    - 用BCE来计算similarity loss
- total loss
  - 首先是single-frame loss：M2F原始那套
  - 然后是video mask loss：用video query预测的per-frame mask的loss
  - 最后是similarity loss：用来match video query和frame query的

Separable Self-attention for Mobile Vision Transformers

introduction
- ViT的efficiency bottleneck是MHA
  - multi-head：costly bmm & mm & softmax
  - self-attention：需要O(n^2)的computation cost
- this paper propose separable self-attention
  - O(n)
  - uses element-wise operations for computing self-attention
method
- attention方式对比
  - dense mm：两两做内积，然后对每个query指向的所有key的distance做归一化，attention是[b,q,q]
  - separate method：对query[b,q,c]先计算一个context score [b,q]，然后对key token [b,q,c]计算加权和[b,q,c]，相当于对所有的key token基于context score做了reweight sum，来融合q & k，最终的attention是[b,q,c]
  - 这样计算复杂度就与q-dim线形相关了
- attention block结构对比
  - standard MHA的attn有h个k*k的dense dot
  - Linformer会压一下key/value token的长度，压到一个固定值，mm的算力下降，但是bmm的串行pipe还在
  - Separable self-attention
    - query：[b,k,d]先fc到[b,k,1]，然后softmax归一化得到context score，可以理解为各个query在整张图上的响应值高低
    - key：[b,k,d]先fc到[b,k,d]，然后与context score做broadcast乘得到[b,k,d]，可以理解为基于query响应值高低对key做rescale来融合qk的信息，然后所有key vector相加得到[b,1,d]，是全局的context vector
    - value：[b,k,d]先fc-relu到[b,k,d]，然后与context vector做broadcast乘得到[b,k,d]，然后是project layer
code

# https://github.com/apple/ml-cvnets/blob/main/cvnets/layers/linear_attention.py#L16

def self_attn(q,k,v):
    # q=k
    q = self.proj_q(q)    # [b,c,hw] -> [b,1,hw]
    k = self.proj_k(k)    # [b,c,hw] -> [b,c,hw]
    v = self.proj_v(v)    # [b,c,hw] -> [b,c,hw]
    
    context_scores = F.softmax(q, dim=-1)  # [b,1,hw]
    context_scores = self.attn_dropout(context_scores)
	
    context_vector = k * context_scores    # [b,c,hw]
    context_vector = torch.sum(context_vector, dim=-1, keepdim=True)   # [b,c,1]
    
    out = F.relu(v) * context_vector    # [b,c,hw]
	out = self.out_proj(out)            # fc-bn

def cross_attn(q,k,v):
	# q!=k
    q = self.proj_q(q)    # [b,c,q] -> [b,c,q]
    k = self.proj_k(k)    # [b,c,hw] -> [b,1,hw]
    v = self.proj_v(v)    # [b,c,hw] -> [b,c,hw]
    
    context_scores = F.softmax(k, dim=-1)  # [b,1,hw]
    context_scores = self.attn_dropout(context_scores)
    
    context_vector = v * context_scores    # [b,c,hw]
    context_vector = torch.sum(context_vector, dim=-1, keepdim=True)   # [b,c,1]

    out = F.relu(q) * context_vector    # [b,c,q]
	out = self.out_proj(out)            # fc-bn

VIS methods

发表于 2023-05-15 |

papers

[RVM 2021]: Robust High-Resolution Video Matting with Temporal Guidance

[SparseInst 2022]: Sparse Instance Activation for Real-Time Instance Segmentation

[GMA 2021]: Learning to Estimate Hidden Motions with Global Motion Aggregation

Robust High-Resolution Video Matting with Temporal Guidance

overview
- robust, real-time, high-resolution human video matting
- uses a recurrent architecture
- propose a new training strategy：多任务，既训matting，又训语义分割
LSTM & GRU & ConvLSTM & ConvGRU
Model Architecture
- a single frame encoder
  - MobileNetV3-Large：the last block of MobileNetV3 uses dilated convolutions，extract features [x2,x4,x8,x16]
  - LR-ASPP
- a recurrent decoder
  - Bottleneck block
    - 接在LRASPP后面的x16 feature上
    - convgru只用了一半通道的feature，split&concat，因为convgru is computationally expansive，这样做既efficient，又能分解current frame feature和long-term temporal information
    - 最后上采样
  - Upsampling block
    - 先concat上一个stage的feature，以及skip过来的encoder feature，以及input image downsampled by repeated 2x2 avg pool
    - 然后conv-bn-relu：perform feature merging and channel reduction
    - 然后是convgru block
    - 最后上采样
  - Output block
    - 已经在原图尺度了，不做GRU了，only uses regular convolutions to refine the results
    - 先concat input和上一个stage的feature
    - 然后是2x conv-bn-relu
    - 最后是prediction head
      - 1-channel alpha prediction
      - 3- channel foreground prediction
      - 1-channel segmentation prediction
- Deep Guided Filter Module for high-resolution upsampling

## Sparse Instance Activation for Real-Time Instance Segmentation

1. introduction

    * Previous instance segmentation methods heavily rely on  object detection and perform mask prediction based on bounding boxes or dense centers

    * this paper 

        * propose a sparse set of instance activation maps（IAM），用来学习instance-level features
        * bipartite matching机制抑制了dense prediction，不需要NMS

    * object representation

        <img src="VIS-methods/object representation.png" width="60%;" />

        * center-based的点可能没有hit到instance上，导致获取的feature不准确
        * region-based包含的内容又太多了
        * IAM类似CAM，是instance-aware weighted maps，比较能抓住instance的主要特征

SparseInst
- encoder
  - backbone: ResNet
  - Instance Context Encoder: FPN，所有feature aggregate到x8一个level的输出
- IAM-based Segmentation Decoder
  - mask branch
  - instance branch
  - 都是由 a stack of 3 × 3 convolutions with 256 channels构成
  - encoder feature上还concat了xy的normed absolute coordinate
- Instance Activation Maps
  - aim to highlight the informative regions for each object
  - given image features $X \in R^{DHW}$
  - IAM模块是a simple 3x3conv+ a sigmoid non-linearity：得到instance activation maps $A \in R^{NHW}$，N是sparse set，也可以用group conv（Group-IAM）
  - instance features是instance activation maps乘image feature：相当于aggregate所有image feature according to attention区域，$z=\overline A \cdot X^T \in R^{ND}$
  - 3 linear layers are applied for classification, objectness score, and mask kernel $\{w_i\}^N$
- IoU-aware Objectness
  - 前背景比例不平衡，大部分predictions会被enforce成背景，整体的classification confidence会偏低，这会导致classification scores 和segmentation masks的效果的misalign
  - 引入IoU prediction缓解这个问题
  - at inference stage，classification probability被rescore成$\sqrt {probability * objectness}$
- Mask Head
  - 每个mask kernel * mask feature

grid sample

optical flow

光流是一个HxWx2的per-pixel矢量(u,v)，表示img1到img2的偏移量：img1(x,y) = img2(x+u, y+u)

warp

import torch
import torch.nn.functional as F

def warp(x, flow):
    # x: [b,c,h,w]
    # flow: [b,2,h,w]
    h,w = x.shape[-2:]
    grid_y, grid_x = torch.meshgrid(torch.arange(h), torch.arange(w))
    grid_xy = torch.stack([grid_x, grid_y], dim=-1).unsqueeze(0) + flow.permute(0,2,3,1)
    grid_xy[...,0] = 2 * grid_xy[...,0] / (w-1) - 1
    grid_xy[...,1] = 2 * grid_xy[...,1] / (h-1) - 1    # norm to [-1,1]
    warp_x = F.grid_sample(x, grid=grid_xy, mode='bilinear', padding_mode='border', align_corners=True)
	return warp_x

grid sample
1
torch.nn.functional.grid_sample(input, grid, mode='bilinear', padding_mode='zeros', align_corners=None)
- input：BCHW
- grid：BHW2，取值范围在[-1,1]之间
- return：BCHW
- mode：[‘nearest’, ‘bilinear’]
- padding_mode：[‘zeros’, ‘border’, ‘reflection’]

deformable conv

采样点的值也可以用grid sample来实现

class DeformConv2d(nn.Module):
    
    def __init__(self, in_channels, out_channels, kernel_size):
        self.conv1 = nn.Conv2d(in_channels, kernel_size*kernel_size*2, kernel_size, padding=kernel_size//2)
        self.conv2 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=kernel_size, padding=kernel_size//2)
        self.k = kernel_size

    def forward(self, x):
        b, c, h, w = x.shape
        offsets = torch.split(self.conv1(x), [self.k*self.k*2, self.k], dim=1)

        grid_y, grid_x = torch.meshgrid(torch.arange(h), torch.arange(w))
        grid_xy = torch.stack([grid_x, grid_y], dim=1).unsqueeze(0)
		kernel_feats = []
        for i in range(self.k*self.k):
            loc = grid_xy + offsets[:,k]
            loc[:,0] = 2 * loc[:,0] / (w-1) - 1            
            loc[:,1] = 2 * loc[:,1] / (h-1) - 1
            kernel_feats.append(F.grid_sample(x, grid=loc, mode='bilinear', padding_mode='border', align_corners=True))  # bchw
        feats = torch.stack(kernel_feats, dim=-1).reshape(b,c,h,w,k,k)
        feats = feats.permute(0,1,2,4,3,5).shape(b,c,h*k,w*k)
        feats = slef.conv2(feats)  # bchw
        return feats

1x1的deformable conv可以理解为对feature学习learnable flow

feature align

SegmentAnything

发表于 2023-05-15 |

overview
- task：propose a prompt-able segmentation task
- model：build SAM model，promptable，enabling zero-shot
- dataset：1B数据集，滚标注的半自动标注方案
introduction
- foundation model
  - models that can generalize to unseen/untrained tasks and data
  - often implemented with prompt engineering
  - 图像领域最有名的就是CLIP/Align：aligns paired text and images from the web use contrastive learning
  - SAM也被构建成foundation model
    - pre-train it on a broad dataset，enables powerful generalization
    - aim to solve a range of downstream segmentation problems on new data distributions using prompt engineering
- model constrains
  - flexible prompt：point, box, and mask / initial results of free-form text prompts
  - real-time：一张图的embedding可以被各种prompt复用，轻量的prompt encoder&mask decoder
  - ambiguity-aware：一个prompt可以预测多个mask，allowing SAM to naturally handle ambiguity
- data engine
  - stage1：assisted-manual，传统滚标注
  - stage2：semi-automatic，用prompt的方式自动标注部分object，人工标注其余的
  - stage3：fully automatic，将prompt设定为a regular grid of foreground points，自动全图标注
Segment Anything Task
- promptable segmentation task
  - prompt：a set of foreground / background points, a rough box or mask, free-form text…
  - return a valid segmentation mask：multiple object when ambiguous
- pre-training
  - input：for each training sample，simulates a sequence of prompts (e.g., points, boxes, masks)
  - supervision：compares the model’s mask predictions against the ground truth
- zero-shot transfer
  - at inference time，模型已经具备对任何prompt作出相应的能力
  - thus downstream tasks can be solved by engineering appropriate prompts，所以downstream tasks可以建模成prompts engineering的任务
Segment Anything Model
- image encoder
  - MAE pretrained ViT：ViT-H/16 with 14x14 windowed attention，1024×1024
  - minimally adapted to process high resolution inputs
  - outputs x16 image features：64x64
  - into image embeddings：conv1,dim256 - LN - conv3-dim256 - LN
- prompt encoder
  - sparse prompts:
    - point: a positional embedding + learned bg/fg embedding
    - box: an embedding pair, (1) a left-top corner positional embedding + earned left-top corner embedding, (2) a bottom-right
    - points/box: positional encodings + learned embeddings for each prompt type
    - text: any text encoder from CLIP
    - dim256
  - dense prompts:
    - input x4 ds masks: downscale additional x4
      - 2x2,s2,dim4 conv - LN - GeLU
      - 2x2,s2,dim16 conv - LN - GeLU
      - 1x1,s1,dim256 conv - LN - GeLU
      - elewise-add on image embeddings
    - if there is no mask prompt: add a learned ‘no mask’ mask embedding
- mask decoder
  - insert a learned output token embedding：类似于cls token的东西，Nx256
  - use a two-layer decoder
  - 4 steps inside each decoder
    - self-attention on the tokens
    - cross-attention from tokens to images：QKVdim128
    - a point-wise MLP updates each token
    - cross-attention from the image to tokens：QKVdim128
  - MLP：drop0.1，with residual，use LN，dim2048
  - geometric和task type的强约束
    - image emb每次参与attention都要加上positional encodings
    - tokens每次参与attention都要加上原始的original prompt tokens
  - dynamic prediction head
    - upsample the updated image embedding by 4× with two transposed convs
      - 2×2, s2, dim64 TransConv - GeLU - LN - 2×2, s2, dim32 TransConv - GeLU
      - 得到upscaled mask embedding，64x64x32
    - 用updated image embedding和updated tokens再做一次attn，提取output token，经过一个3-layer MLP得到一个vector：Nx32
    - 最后通过dot product得到Nx64x64的prediction mask
  - ambiguity-aware
    - use 3 output tokens，predict 3 masks(whole, part, and subpart)
    - 计算3组mask loss，但是只回传the lowest loss的梯度
    - 在output tokens上再接一个small head，用来estimates IOU，inference time用这个IoU prediction来rank3个mask
  - loss
    - mask loss：focal loss*20 + dice loss*1
    - iou loss：pred iou和mask iou的MSE
  - Training algorithm for one sample
    - init prompt：随机选取points/boxes作为input prompt
      - points从gt masks中随机抽
      - boxes基于gt masks的外接框做随机noisy deviation
    - 接下来的points从pred mask和gt mask的error region中抽取
    - 然后把pred mask也作为一个mask prompt输入给model：用的是unthresholded mask logits，而不是binary mask
    - 这样迭代8+2个iteration

knowledge distillation

发表于 2022-11-05 |

paper collection

[Similarity KD 2019] Similarity-Preserving Knowledge Distillation

[CWD 2020] Channel-wise Knowledge Distillation for Dense Prediction

Similarity-Preserving Knowledge Distillation

与其他KD methods的区别：
- the student is not required to mimic the representation space of the teacher, but rather to preserve the pairwise similarities in its own representation space
- 监督student和teacher的similarity
- 适用场景：
  - feature shape对不齐：channel/stride，也可以用1x1 conv对齐
  - CNN/transformer之间的KD，模型本质diversity就不同

method

given activation map（feature）$F$
- compute similarity：$F*F^T/L2norm(F)$
- compute KD loss：$\frac{1}{b^2}MSE(t,s)$

code

import torch
import torch.nn as nn
import torch.nn.functional as F


def similarity_loss(self, f_s, f_t):
    bs = f_s.shape[0]
    f_s = f_s.view(bs, -1)
    f_t = f_t.view(bs, -1)

    G_s = f_s * f_s.T   # bxb
    G_s = F.normalize(G_s, p=2, dim=1)
    G_t = f_t * f_t.T
    G_t = F.normalize(G_t, p=2, dim=1)

    G_diff = G_t - G_s
    loss = (G_diff * G_diff).view(-1).sum() / (bs * bs)
    return loss

其他KD methods

response based：KL divergence loss

作用于softmax以后的probs（可以有temperature factor）
$L_{RD} (p_t, p_s) = L_R (p_t, p_s)$
$L_R()$通常是KL divergence loss

KL divergence loss

# pred在调用KLDivLoss方法计算loss时先做log，防止先两个normed prob先做除法损失精度
pred = F.log_softmax(torch.randn(3, 5, requires_grad=True))
target = F.softmax(torch.rand(3, 5))
loss = nn.KLDivLoss(reduction="batchmean")(pred, target)
# target在调用KLDivLoss时也可以先做log
log_target = F.log_softmax(torch.rand(3, 5))
output = nn.KLDivLoss(reduction="batchmean", log_target=True)(pred, log_target)

def kl_categorical(p_logit, q_logit):
    p = F.softmax(p_logit, dim=-1)
    _kl = torch.sum(p * (F.log_softmax(p_logit, dim=-1) - F.log_softmax(q_logit, dim=-1)), -1)
    return torch.mean(_kl)

feature based
- $L_{FD} (f_t, f_s) = L_F (\Phi_s(f_t), \Phi_s(f_s))$
- $\Phi()$是transform function，用来align feature dimension
- $L_F()$是similarity function，通常可以是L1、L2、L_CE、L_MMD
- MMD：maximum mean discrepancy

light-weight transfomers

发表于 2022-11-05 |

papers

[tinyViT ECCV2022] TinyViT: Fast Pretraining Distillation for Small Vision Transformers：微软，swin的todevice版本，用CLIP做unlabel KD

[topformer CVPR2022] TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation：

[mobileViT ICLR2022] MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER：苹果

[mobileformer CVPR2022] Mobile-Former: Bridging MobileNet and Transformer：微软

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

abstract
- task：semantic segmentation
- take advantage of CNNs & ViTs
  - CNN-based Token Pyramid Module：MobileNetV2 blocks
  - a ViT-based module Semantics Extractor
- verified on
  - ADE20K：比mobileNetV3高5%的miou，同时latency更低
method
- overview
  - Token Pyramid Module：输入image，输出token pyramid
  - Semantics Extractor：输入token pyramid，输出scale-aware semantics
  - Semantics Injection Module：将global semantics融合进对应level的local tokens里面，得到augmented representations
  - Segmentation Head：fuse and predict
- Token Pyramid Module
  - use stacked MobileNet blocks来生成token pyramids
  - 不是为了丰富的语义信息，也不是为了更大的感受野，只是为了得到multi-scale tokens，所以用比较少的层
  - 输入512x512的图
  - 输出x4/x8/x16/x32的feature map
  - 然后将multi-scale的features统一avg pool到x64，然后concat by dim，得到[b,(16+32+64+96),64,64]的tokens
- Scale-aware Semantics Extractor
  - 这部分用vision transformer来提取语义信息
  - 首先为了keep the shape，将linear layer换成1x1 conv，把layernorm换成了batchnorm
  - 所有ViT-stype GeLU换成ReLU6
  - MSA的QK投影维度为16，V投影维度为32，降低similarity的计算量
  - FFN的两层1x1卷积之间加了一个depth-wise卷积
- Semantics Injection Module
  - fuse local tokens 和 scale-aware semantics
  - to alleviate the semantic gap
    - local token经过一个1x1 conv-bn，用semantic token经过一个1x1conv-bn-sigmoid生成的weight map做加权
    - 再加上一个1x1 conv-bn的semantic token
    - （这个semantic token应该是要上采样的吧？）
- Segmentation Head
  - 所有的low-resolution tokens都上采样的high-resolution tokens的尺度（x8）
  - 然后element-sum
  - 然后两层卷积
实验
- datasets
- ADE20k：1+150，25k（20k/2k/3k）
  - PASCAL Context：1+59，4998/5105
- COCO-Stuff10k：9k/1k
- training details
  - iteration：ADE20k训练160k iteration，PASCAL和COCO训练80k iteration
  - syncBN
  - lr：baselr=0.00012，poly schedule with factor 1.0
- weight decay：0.01

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

abstract
- tiny and efficient
- KD on large-scale datasets
- 用大模型的logits离线蒸馏
- 精度：21M：ImageNet 84.8%，use larger resolution：86.5%
- good transfer ability on downstream tasks
  - linear probe
  - few-shot learning
  - object detection：COCO AP 50.2%
overview
- 离线蒸馏
- 保存数据增强的参数和teacher logits
方法
- Fast Pretraining Distillation
  - direct pretraining small models on massive data does not bring much gains，especially when transferring to downstream tasks，但是蒸馏的boost就十分显著了
  - 一些方法在finetuning阶段进行蒸馏，本文focus on pretraining distillation：inefficient and costly
  - given image x
    - save strong data augmentation A （RandAugment & CutMix）
    - save teacher prediction $\hat y=T(A(x))$
  - 因为RandAugment内置随机性，同样的增强方式每次结果也不同，因为每个iteration的A都要存下来
  - 用ground truth的one-hot label去蒸馏的效果反而不如unlabeled logits，因为one-hot太overfit了
  - Sparse soft labels：imagnet21k的logits有21, 841个，为了节约只保存topK logits and indices，其余的用label smoothing
  - Data augmentation encoding：
    - 将一系列data aug params encode成一个scalar d
    - 然后通过decoder将其还原：PCG输入single parameter，输出a sequence of parameters
    - 实际实现的时候，就是存下一个d0，然后用PCG对T&S解码增强参数
- Model Architectures
  - adopt hierarchical vision transformer，从大模型开始，定义一系列contraction factors，开始scaling down
    - patch embedding：2个3x3 convs，stride2，pad1
    - stage1：MBConvs & down sampling blocks
    - stage2/3/4：transformer blocks with window attention
    - attention biases
    - a 3 × 3 depthwise convolution between attention and MLP
    - 常规的residuals、LayerNorm依旧保留，conv的norm是BN，所有activation都是GELU
  - Contraction factors
    - embeded dimension：21M-[96,192,384,576]，11M-[64,128,256,448]，5M-[64,128,160,320]
    - number of blocks：[2,2,6,2]
    - window size：[7,14,7]
    - channel expansion ratio of the MBConv block：4
    - expansion ratio of MLP：4
    - head dimension：32

tiny dataset

发表于 2022-10-30 |

说明：这个文档用来记录一些常用的dataset

cityscape
- 主要用于语义分割
- 包含5000精细标注（gtFine/2975 training, 500 validation, and 1525 testing）和20000粗标注（gtCoarse）
- leftImg8bit里面是原图，8-bit LDR
- 30个类别
- 数据处理脚本：https://github.com/mcordts/cityscapesScripts
- （也包含3D标注/right stereo views & disparity）
ade20k
- 用于语义分割
- 25,000张图像（20ktrain，2k val，3ktest）
- *.png是原图
- *_seg.png是mask：R和G通道编码对象类掩码，通道B对实例对象掩码进行编码
- * _.txt是文本描述文件
- 包含室内外场景
- 共包含3688个类别，其中高频类别150类（100个thing和50个stuff，占所有像素的89％）
coco-stuff
- COCO-Stuff是对COCO2017数据集中全部164K图片做了像素级的标注
- 包含80 thing classes, 91 stuff classes and 1 class ‘unlabeled’
- 图像共用，标签分为stuffthingmaps_trainval2017.zip / stuff_trainval2017.zip (Stuff-only) / annotations_trainval2017.zip (thing-only)，用灰度图格式保存
PPM-100
- 人像抠图

real-time semantic segmentation

发表于 2022-10-30 |

cityscape leaderboard:

[PIDNet 2022] PIDNet: A Real-time Semantic Segmentation Network Inspired from PID Controller
[SFNet v1 2020] Semantic Flow for Fast and Accurate Scene Parsing
[SFNet v2 2022] SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow
[PP-LiteSeg 2022] PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
[DDRNet 2021] Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
[STDC-Seg 2021CVPR] Rethinking BiSeNet For Real-time Semantic Segmentation

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

main contributions
- propose a Flexible and Lightweight Decoder (FLD)：主要就是FPN的通道数
- propose a Unified Attention Fusion Module (UAFM)：强化特征表达
- propose a Simple Pyramid Pooling Module (SPPM)：简化PPM，low computation cost
real-time networks设计思路
- Strengthening feature representations：decoder中low-level和high-level特征的融合方式
- Contextual aggregation：花式PPM
overview
方法
- Flexible and Lightweight Decoder
  - recent lightweight models中的decoder在恢复resolution的过程中通道数保持不变：cause computation redundancy
  - FLD从high-level到low-level逐渐减少通道数
- Unified Attention Fusion Module
  - 用一个attention module来产生weight map
  - 然后对不同尺度的特征做加权和
  - Spatial Attention Module
    - 实际代码中用的这个
    - 在channel维度上做mean & max，两个feature共得到4个[1,h,w]的map，concat在一起[4,h,w]
    - 然后做conv+sigmoid得到weight map [1,h,w]
    - 然后做element-wise的mul&add
  - Channel Attention Module
    - 在spacial map上做avg pooling和max pooling，两个feature共得到4个[c,1,1]的vector，concat在一起[4c,1,1]
    - 然后做conv+sigmoid得到channel importance vec [c,1,1]
    - 然后做channel-wise的mul&add
- Simple Pyramid Pooling Module
  - 主要改动
    - reduce intermediate and output channels
    - remove the short-cut
    - replace concat with add
  - 就3个global average pooling，得到1x1、2x2、4x4的map，然后分别conv-bn-relu，然后resize回原来的尺度
  - 然后add & conv
实验
- datasets
  - Cityscapes：1+18，5,000（2975/500/1525），原始尺寸是2048x1024
  - CamVid：11类，701（367/101/233），原始尺寸是960x720
- training settings
  - SGD：momentum=0.9
  - lr：warmup，poly，
  - Cityscapes训练160k，batchsize=16，baselr=0.005，weight decay=5e-4
  - CamVid训练1k，batchsize=24，baselr=0.01，weight decay=1e-4
  - dataaug
    - random scale：scale range [0.125, 1.5] / [0.5, 2.5] for Cityscapes/CamVid
    - random crop：crop size=[1024,512] / [960,720]
    - random horizontal flipping / color jitting / normalization

SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow

动机
- widely used atrous convolution & feaature pyramid: computation intensive or ineffective
- we propose SFNet & SFNet-Lite
  - Flow Alignment Module (FAM) to learn sematic flow
  - Gated Dual Flow Alignment Module (GD-FAM): directly align the highest & lowest resolution feature maps
- speed & accuracy
  - verified on 4 driving datasets (Cityscapes, Mapillary, IDD and BDD)
  - SFNet-Lite-R18 back: 80.1 mIoU / 60 FPS
  - SFNet-Lite-STDC back: 78.8 mIoU / 120 FPS
方法
- previous methods
  - FCN
    - 开创性的工作
    - lack of detailed object boundary information due to down-sampling
  - deeplab
    - atrous convolutions （last 4/5 stage）
    - multiscale feature representation （ASPP）
  - vision transformer
    - 建模成query-based per-segment prediction
    - strong performance but real time inference speed不理想
- trade-off
  - maintain detailed resolution：空洞卷积，resolution大计算量就大，很难实时
  - get features that exhibit strong semantic representation：biFPN style feature merge，有一定提升，但还是不如those hold large feature maps
  - 本文的推测是：ineffective propagation of semantics from deep layers to shallow layers，across level的semantics are not well aligned，粗暴的downsample再upsample，无法恢复细节边缘信息
- this paper
  - propose FAM & SFNet
    - 不同stage的feature之间，先做alignment，warp low level feature之后再merge
    - R18-back： 79.8% mIoU / 33 FPS on cityscape
    - DF2-back：77.8% mIoU / 103 FPS
  - propose GD-FAM & SFNet-Lite
    - 为了更快，把密集的FAM换成只做一次的GD-FAM
    - 只merge this highest & lowest尺度的features
    - ResNet-18 back：80.1% mIoU / 49 FPS on cityscape
    - STDCv1-back： 78.7 mIoU / 120 FPS
方法
- 出发点： the misalignment between feature maps caused by residual connection, repeated downsampling and up- sampling operations
- inspiration：dynamic upsampling interpolation
- FAM
  - build in the FPN framework
  - define Semantic Flow：between different levels in a feature pyramid
  - pipeline
    - 首先通道对齐
    - 然后上采样，尺度对齐，$F \in R^{H\times W \times C}$
    - 然后concat两个level的特征，（但是没有像真正的光流对齐FlowNet一样再concat上grid coordinates）
    - 然后用两层3x3的卷积做semantic flow field的预测，$\Delta_{low-level} \in R^{H\times W\times 2}$
    - 然后warp low-level feature
    - 然后add & halve
  - 和deformable conv的区别
    - 首先这个offset是通过融合两个尺度特征得到的，DCN是特征自己预测的
    - 其次DCN是为了得到更大的/更自由的reception field，more like attention，本文是为了align feature
  - 可以看到warp & add以后的feature map相比较于直接上采样然后add，更加structurally neat，目标可以有更consistent的representation
- the whole network
  - backbone
    - ImageNet pretrained ResNet / DF series
    - 4 stages
    - stride2 in the first place per stage
    - Pyramid Pooling Module (PPM)，ASPP/NL is also experimented，后面实验部分说精度不如PPM
  - Aligned FPN decoder
    - encoder那边过来的stage2/3/4的low-level feature，被FAM aligned，然后fused into their bottom levels
    - 最终的x4的prediction feature，concat来自所有尺度的特征，考虑到也存在mis-alignment，本文也实验性的添加了PPM，但是在real-time application中没加
- Gated Dual Flow Alignment Module and SFNet-Lite
  - 上述版本的SFNet，在speed上是比BiSegNet慢的，thus we explore more compact decoder
  - takes F1 & F4 as inputs
  - outputs a refined high-resolution map
  - pipeline
    - 首先将F4上采样的F1的resolution
    - 然后concat，然后3x3convs，预测offsets，$\Delta \in R^{H\times W\times 4}$
    - 然后split成$\Delta_{F1}$和$\Delta_{F4}$，分别给到F1和F4做align
    - 再用F1和F4生成一个gate map，attention-style的结构，用pooling，1x1 conv和sigmoid，给两个warped feature做gated sum，思路是make full use of high level semantic feature and let the low level feature as a supplement of high level feature
  - SFNet-Lite structure
实验

STDC-Seg: Rethinking BiSeNet For Real-time Semantic Segmentation

动机
- BiSeNet
  - use extra path to encode spatial information (low-level)
  - time consuming
  - not convenient to use pretrained backbones
  - the auxiliary path is always lack of low-level information guidance
- this paper
  - 回归test-time singe-stream manner
  - 用一个Detail guidance module来促进backbone的low-level stage学习spatial feature，有直接监督，且test-time free cost
  - 设计了STDC backbone，主要包含STDC module (Short-Term Dense Concatentae Network)，有点类似denseNet的block
- verified on
  - ImageNet
  - Cityscapes: STDC1-Seg50 / 71.9% mIoU / 250.4 FPS, STDC2-Seg75 / 76.8% mIoU / 97.0 FPS
  - CamVid
overview
- single-stream
- STDC backbone
方法
- Short-Term Dense Concatenate Module
  - each module is separated into several blocks：block1永远是1x1，block2/3/4是3x3
  - the output gathers multi-scale information
  - 一种是stride1的一种是stride2的，reception field如下
- Classification Architecture
  - stage1/2分别是一个conv-bn-relu
  - stage3/4/5是STDC Module，每个stage的第一个module用stride2
- Segmentation Architecture
  - STDC back：用stage3/4/5的feature map（x8/16/32）
  - stage3的feature作为low-level feature
  - stage4/5以及global pooling的stage5的feature作为high-level context feature，做FPN：stage4/5通过Attention Refine Module（类似SE），然后和前一个feature level做add，然后上采样，然后conv
  - 以上的feature通过Feature Fusion Module（也类似SE block）融合
  - SegHead：3x3conv-1x1conv
  - stage3的feature上还接了一个DetailHead
- Detail Guidance of Low-level Features
  - detail path是用来encode spatial detail（boundary/corner）
  - 建模成binary segmentation task
  - 首先将ground truth map通过Detail Aggregation module得到detail map
    - 一个stride1/2/4的Laplacian operator（conv kernel）
    - 然后是upsampling
    - 然后fuse and 1x1 conv
    - 最后用thresh 0.1转换成binary detail mask
  - detail loss：dice + bce
  - 有了detail guidance以后能够force backbone的stage3 feature保留更加detail的low-level feature
实验
- backbone实验
  - 对标MobileNetV3和EfficientNet-B0，精度和速度都是更好的
  - ImageNet精度&Flops，Flops比较大，但都是3x3卷积，推理速度更快

mask2former

发表于 2022-10-23 |

papers

[maskformer 2021] Per-Pixel Classification is Not All You Need for Semantic Segmentation：Facebook，

[mask2former 2022] Masked-attention Mask Transformer for Universal Image Segmentation：

[mask2former+1] Mask2Former for Video Instance Segmentation：追加了一个在video上面的report

[mask2former] Masked-attention Mask Transformer for Universal Image Segmentation

CLIP系列

发表于 2022-10-22 |

papers

[2021 CLIP] Learning Transferable Visual Models From Natural Language Supervision

[2022 MaskCLIP] Extract Free Dense Labels from CLIP

[2022 DenseCLIP ]DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

CLIP: Learning Transferable Visual Models From Natural Language Supervision

动机
- visual tasks通常建模成给定的图预测成给定类别的任务，大大限制了数据的可用量
- this paper
  - leverage language concepts
  - build a simple pretraining task：predicting which caption goes with which image
  - use 400 million image-text pairs from internet
  - train from scratch
  - enable zero-shot transfer to downstream tasks
overview
- 图1是pretraining模型
- 图2是获取给定text set，构建prompt，然后encoder得到固定的linear classifier
- 图3是zero-shot classification，就是给每个样本的image embedding，用上述linear classifier做logistics regression，得到其匹配不同类别的概率
方法
- Creating a Sufficiently Large Dataset
  - construct a new dataset of 400 million image-text pairs from a variety of publicly available sources on the Internet
  - base query：Wikipedia里面出现100次以上的单词，a set of 500,000
  - search policy：基于base query的单词表搜索，balancing by 20,000 paris per query
- Selecting an Efficient Pre-Training Method
  - 第一次尝试：
    - jointly train a image CNN & a text transformer
    - 发现transformer模型对有限的1000类都收敛极慢，对开放词典及其不友好
  - 然后建立了a easier task
    - 只需要预测哪个text和哪个image匹配，而不用预测the exact text
    - 4x efficiency in zero-shot transfer
  - formulation
    - given a batch of N (image, text) pairs
    - jointly train an image encoder & a text encoder，将其映射到multi-modal embedding space
    - predict NxN pairings
    - 最大化pairs的cosine similarity，最小化$N^2-N$的incorrect pairings
    - optimize a symmetric cross entropy loss
- Choosing and Scaling a Model
  - two different structures for Image CNN
    - ResNet & ViT
    - ResNet的gloval average pooling改成了transformer-stype的attention pooling
    - 用global query作为feature representation
  - text encoder
    - 就是一个现成的transformer
    - token向量operates on a lower-cased byte pair encoding (BPE)
    - sentence最长76，add SOS & EOS
    - EOS token作为feature representation，LN & linear projected into embedding space
- Model Zoo
  - 5 ResNets
    - ResNet-50, ResNet-101
    - RN50x4, RN50x16, and RN50x64：4x, 16x, and 64x computation models follow EfficientNet-style model scaling
  - 3 ViTs
    - ViT-B/32, a ViT-B/16, and a ViT-L/14
    - ViT-L/14@336px：在224基础上，用336的resolution train one additional epoch
  - temperature parameter
    - 0.07，clip to prevent scaling the logits by more than 100
    - 原始的logits/T，但是新logits不超过100
    - 这是因为cosine similarity的输出在[-1,1]，而一般用于分类预测的logits通常是不限幅的，所以用temperature factor来拉大cos logits之间的差异，提高正样本置信度
    - necessary for training stability
实验
- Zero-Shot Transfer
  - 实验发现CLIP对unseen datasets（没有用于训练当前模型的数据集）有很好的zero-shot transfer能力，主要是因为它在互联网上见的太多了
  - zero-shot classification pipeline：详见overview的图
    - 用目标数据集，所有类别，作为text pairings set，然后预测样本的most probable image-text pair
    - 首先获得各自的feature embedding，各自L2-norm
    - 然后计算cosine similarity，scaled by temperature factor
    - 然后normalized by softmax into probability
  - 精度
    - leaderboard上ResNet101的精度：top1@80.98%，top5@95.51%
    - leaderboard上ResNet50的精度：top1@79.25%，top5@94.65%
- Representation Learning
  - linear-probe pipeline
    - 固定住pretrained model
    - fitting一个linear classifier
    - 这样相比较于finetuning的好处是hyper比较少，同时特征比较general/class-agnostic
  - findings
    - small models（RN50/RN101）在ImageNet-21K上打不过对应模型
    - small models在同样数据集上也打不过efficientNet家族
    - 但是大模型（RN50x64）能够打败目前最好的（Noisy Student EfficientNet-L2）
    - CLIP transformers are 3x more compute efficient than CLIP ResNets，能够在同样的算力条件下获得更好的performance
- prompt engineering and ensembling
  - 图像数据集的类别大多是id/一个单词
  - prompt将其构造成一个句子：a photo of {word} / a {specific} of {word}
  - ensemble将多种构造的embedding求mean

amber.zhang

要糖有糖，要猫有猫

GitHub