LV-ViT

发表于 2021-05-21 |

[LV-ViT 2021] Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet，新加坡国立&字节，主体结构还是ViT，deeper+narrower+multi-layer-cnn-patch-projection+auxiliary label&loss

同等参数量下，能够达到与CNN相当的分类精度

26M——84.4% ImageNet top1 acc
56M——85.4% ImageNet top1 acc
150M——86.2% ImageNet top1 acc

ImageNet & ImageNet-1k：The ImageNet dataset consists of more than 14M images, divided into approximately 22k different labels/classes. However the ImageNet challenge is conducted on just 1k high-level categories (probably because 22k is just too much)

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

动机
- develop a bag of training techniques on vision transformers
- slightly tune the structure
- introduce token labeling——a new training objective
- ImageNet classificaiton task
论点
- former ViTs
  - 主要问题就是需要大数据集pretrain，不然精度上不去
  - 然后模型也比较大，need huge computation resources
  - DeiT和T2T-ViT探索了data augmentation/引入additional token，能够在有限的数据集上拉精度
- our work
  - rely on purely ImageNet-1k data
  - rethink the way of performing patch embedding
  - introduce inductive bias
  - we add a token labeling objective loss beside cls token predition
  - provide practical advice on adjusting vision transformer structures
方法
- overview & comparison
  - 主体结构不变，就是增加了两项
  - a MixToken method
  - a token labeling objective
- review the vision transformer
  - patch embedding
    - 将固定尺寸的图片转换成patch sequence，例如224x224的图片，patch size=16，那就是14x14个small patches
    - 将每个patch(16x16x3=768-dim) linear project成一个token(embedding-dim)
    - concat a class token，构成全部的input tokens
  - position encoding
    - added to input tokens
    - fixed sinusoidal / learnable
  - multi-head self-attention
    - 用来建立long-range dependency
    - multi-heads：所有attention heads的输出在channel-dim上concat，然后linear project回单个head的channel-dim
  - feed-forward layers
    - fc1-activation-fc2
  - score predition layer
    - 只用了cls token对应的输出embedding，其他的discard
- training techniques
  - network depth
    - add more transformer blocks
    - 同时decrease the hidden dim of FFN
  - explicit inductive bias
    - CNN逐步扩大感受野，擅长提取局部特征，具有天然的平移不变性等
    - transformer被发现failed to capture the low-level and local structures
    - we use convolutions with a smaller stride to provide an overlapped information for each nearby tokens
    - 在patch embedding的时候不是independent crop，而是有overlap
    - 然后用多层conv，逐步扩大感受野，smaller kernel size同时降低了计算量
  - rethinking residual connection
    - 给残差分支add a smaller ratio $\alpha$
    - enhance the residual connection since less information will go to the residual branch
    - improve the generalization ability
  - re-labeling
    - label is not always accurate after cropping
    - situations are worse on smaller images
    - re-assign each image with a K-dim score map，在1k类数据集上K=1000
    - cheap operation compared to teacher-student
    - 这个label是针对whole image的label，是通过另一个预训练模型获取
  - token-labeling
    - based on the dense score map provided by re-labeling，we can assign each patch an individual label
    - auxiliary token labeling loss
      - 每个token都对应了一个K-dim score map
      - 可以计算一个ce
    - given
      - outputs of the transformer $[X^{cls}, X^1, …, X^N]$
      - K-dim score map $[y^1, y^2, …, y^N]$
      - whole image label $y^{cls}$
    - loss
      - auxiliary token labeling loss：$L_{aux} = \frac{1}{N} \sum_1^N CE(X^i, y^i)$
      - cls loss：$L_{cls} = CE(X^{cls}, y^{cls})$
      - total loss：$L_{total} = L_{cls}+\beta L_{aux}$，$\beta=0.5$
  - MixToken
    - 从Mixup&CutMix启发来的
    - 为了确保each token have clear content，我们基于token embedding进行mixup
    - given
      - token sequence $T_1=[t^1_1, t^2_1, …, t^N_1]$ & $T_2=[t^1_2, t^2_2, …, t^N_2]$
      - token labels $y_1=[y^1_1, y^2_1, …, y^N_1]$ & $Y_2=[y^1_2, y^2_2, …, y^N_2]$
      - binary mask M
    - MixToken
      - mixed token sequence：$\hat T = T_1 \odot M + T_2 \odot (1-M)$
      - mixed labels：$\hat Y = Y_1 \odot M + Y_2 \odot (1-M)$
      - mixed cls label：$\hat {Y^{cls}} = \overline M y_1^{cls} + (1-\overline M) y_2^{cls}$，$\overline M$ is the average of $M$
实验
- training details
  - AdamW
  - linear lr scaling：larger when use token labeling
  - weight decay
  - dropout：hurts small models，use Stochastic Depth instead
- Training Technique Analysis
  - more convs in patch embedding
  - enhanced residual
    - smaller scaling factor
      - the weight get larger gradients in residual branch
      - more information can be preserved in main branch
      - better performance
      - faster convergence
  - re-labeling
    - use NFNet-F6 to re-label the ImageNet dataset and obtain the 1000-dimensional score map for each image
    - NFNet-F6 is trained from scratch
    - given input 576x576，获得的score map是18x18x1000（s32）
    - store the top5 probs for each position to save storage
  - MixToken
    - 比baseline的CutMix method要好
    - 同时看到token labeling比relabeling要好
  - token labeling
    - relabeling是在whole image上
    - token labeling是进一步地，在token level添加label和loss
  - augmentation techniques
    - 发现MixUp会hurt
  - Model Scaling
    - 越大越好

memory bank

发表于 2021-05-19 |

2018年的paper
official code：https://github.com/zhirongw/lemniscate.pytorch
memory bank
NCE

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

动机
- unsupervised learning
  - can we learn good feature representation that captures apparent similarity among instances instead of classes
  - formulate a non-parametric classification problem at instance-level
  - use noise contrastive estimation
- our non-parametric model
  - highly compact：128-d feature per image，only 600MB storage in total
  - enable fast nearest neighbour retrieval
- 【QUESTION】无类别标签，单靠similarity，最终的分类模型是如何建立的？
- verified on
  - ImageNet 1K classification
  - semi-supervised learning
  - object detection tasks
论点
- observations
  - ImageNet top-5 err远比top-1 err小
  - second highest responding class is more likely to be visually related
  - 说明模型隐式地学到了similarity
  - apparent similarity is learned not from se- mantic annotations, but from the visual data themselves
- 将class-wise supervision推到一个极限
  - 就变成了instance-level
  - 类别数变成了the whole training set：softmax to many more classes becomes infeasible
    - approximate the full softmax distribution with noise-contrastive estimation(NCE)
    - use a proximal regularization to stablize the learning process
- train & test
  - 通常的做法是learned representations加一个线性分类器
  - e.g. SVM：但是train和test的feature space是不一致的
  - 我们用了KNN：same metric space
方法
- overview
  - to learn a embedding function $f_{\theta}$
  - distance metric $d_{\theta}(x,y) = ||f_{\theta}(x)-f_{\theta}(y)||$
  - to map visually similar images closer
  - instance-level：to distinct between instances
- Non-Parametric Softmax Classifier
  - common parametric classifier
    - given网络预测的N-dim representation $v=f_{\theta}(x)$
    - 要预测C-classes的概率，需要一个$W^{NC}$的projection：$P(i|v) = \frac{exp (W^T_iv)}{\sum exp (W^T_jv)}$
  - Non-Parametric version
    - enforce $||v||=1$ via L2 norm
    - replace $W^T$ with $v^T$
    - then the probability：$P(i|v) = \frac{exp (v^T_iv/\tau)}{\sum exp (v^T_jv / \tau)}$
    - temperature param $\tau$：controls the concentration level of the distribution
    - the goal is to minimize the negative log-likelihood
    - 意义：L2 norm将所有的representation映射到了一个128-d unit sphere上面，$v_i^T v_j$度量了两个projection vec的similarity，我们希望同类的vec尽可能重合，不同类的vec尽可能正交
      - class weights $W$ are not generalized to new classes
      - but feature representations $V$ does
  - memory bank
    - 因为是instance level，C-classes对应整个training set，也就是说${v_i}$ for all the images are needed for loss
    - Let $V={v_i}$ 表示memory bank，初始为unit random vectors
    - every learning iterations
      - $f_\theta$ is optimized by SGD
      - 输入$x_i$所对应的$f_i$更新到$v_i$上
      - 也就是只有mini-batch中包含的样本，在这一个step，更新projection vec
- Noise-Contrastive Estimation
  - non-parametric softmax的计算量随着样本量线性增长，millions level样本量的情况下，计算太heavy了
  - we use NCE to approximate the full softmax
  - assume
    - noise samples的uniform distribution：$P_n =\frac{1}{n}$
    - noise samples are $m$ times frequent than data samples
  - 那么sample $i$ matches vec $v$的后验概率是：$h(i,v)=\frac{P(i|v)}{P(i|v)+mP_n}$
    - approximated training object is to minimize the negative log-likelihood of $h(i,v)$
  - normalizing constant $Z$的近似
    - 主要就是分母这个$Z_i$的计算比较heavy，我们用Monte Carlo采样来近似：
    - ${j_k}$ is a random subset of indices：随机抽了memory bank的一个子集来approx全集的分母，实验发现取batch size大小的子集就可以，m=4096
- Proximal Regularization
  - the learning process oscillates a lot
    - we have one instance per class
    - during each training epoch each class is only visited once
  - we introduce an additional term
    - overall workflow：在每一个iteration t，feature representation是$v_i^t=f_{\theta}(x_i)$，而memory bank里面的representations来自上一个iteration step $V={v^{t-1}}$，我们从memory bank里面采样，并计算NCE loss，然后bp更新网络权重，然后将这一轮fp的representations update到memory bank的指定样本上，然后下一轮
    - 可以发现，在初始random阶段，梯度更新会比较快而且不稳定
    - 我们给positive sample的loss上额外加了一个$\lambda ||v_i^t-v_i^{t-1}||^2_2$，有点类似weight decay那种东西，开始阶段l2 loss会占主导，引导网络收敛
    - stabilize
    - speed up convergence
    - improve the learned representations
- Weighted k-Nearest Neighbor Classifier
  - a test time，先计算feature representation，然后跟memory bank的vectors分别计算cosine similarity $s_i=cos(v_i, f)$，选出topk neighbours $N_k$，然后进行weighted voting
  - weighted voting：
    - 对每个class c，计算它在topk neighbours的total weight，$w_c =\sum_{i \in N_k} \alpha_i 1(c_i=c)$
    - $\alpha_i = exp(s_i/\tau)$
  - k = 200
  - $\tau = 0.07$

MoCo系列

发表于 2021-04-30 |

papers：

[2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning，kaiming

[2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations，Google Brain，混进来是因为它improve based on MoCo v1，而MoCo v2/v3又都是基于它改进

[2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning，kaiming

[2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers，kaiming

preview: 自监督学习 Self-supervised Learning

reference：https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html
overview
- 就是无监督
- 针对的痛点（有监督训练模型）
  - 标注成本高
  - 迁移性差
- 会基于数据特点，设置Pretext tasks（最常见的任务就是生成/重建），构造Pesdeo Labels来训练网络
- 通常模型用来作为其他学习任务的预训练模型
- 被认为是用来学习图像的通用视觉表示
methods
- 从结构上区分主要就是两大类方法
  - 生成式：通过encoder-decoder结构还原输入，监督信号是输入输出尽可能相似
    - 重建任务开销大
    - 没有建立直接的语义学习
    - 外加GAN的判别器使得任务更加复杂难训
  - 判别式：输入两张图片，通过encoder编码，监督信号是判断两张图是否相似，判别式模型也叫Contrastive Learning
- 从Pretext tasks上划分主要分为三类
  - 基于上下文（Context based）：如bert的MLM，在句子/图片中随机扣掉一部分，然后推动模型基于上下文/语义信息预测这部分/相对位置关系
  - 基于时序（Temporal Based）：如bert的NSP，视频/语音，利用相邻帧的相似性，构建不同排序的序列，判断B是否是A的下一句/是否相邻帧
  - 基于对比（Contrastive Based）：比较正负样本，最大化相似度的loss在这里面被叫做InfoNCE
memory-bank
- Contrastive Based方法最常见的方式是在一个batch中构建正负样本进行对比学习
  - end-to-end
  - 每个mini-batch中的图像增强前后的两张图片互为正样本
  - 字典大小就是minibatch大小
- memory bank包含数据集中所有样本编码后特征
  - 随机采样一部分作为keys
  - 每个迭代只更新被采样的样本编码
  - 因为样本编码来自不同的training step，一致性差
- MoCo
  - 动态编码库：out-of-date的编码出列
  - momentum update：一致性提升
InfoNCE
- deep mind在CPC(Contrastive Predictive Coding)提出，论文以后有机会再展开
  - unsupervised
  - encoder：encode x into latent space representations z，resnet blocks
  - autoregressive model：summarize each time-step set of {z} into a context representation c，GRUs
  - probabilistic contrastive loss
    - Noise-Contrastive Estimation
    - Importance Sampling
- 训练目标是输入数据x和context vector c之间的mutual information
  - 每次从$p(x_{t+k}|c_t)$中采样一个正样本：正样本是这个序列接下来预测的东西，和c的相似性肯定要高于不想干的token
  - 从$p(x_{t+k})$中采样N-1个负样本：负样本是别的序列里面随机采样的东西
  - 目标是让正样本与context相关性高，负样本低

MoCo v1: Momentum Contrast for Unsupervised Visual Representation Learning

动机
- unsupervised visual representation learning
- contrastive learning
- dynamic dictionary
  - large
  - consisitent
- verified on
  - 7 down-stream tasks
  - ImageNet classification
  - VOC & COCO det/seg
论点
- Unsupervised representation learning
  - highly successful in NLP，in CV supervised is still the main-stream
  - 两个核心
    - pretext tasks
    - loss functions
  - loss functions
    - 生成式方法的loss是基于prediction和一个fix target来计算的
    - contrastive-based的key target则是vary on-the-fly during training
    - Adversarial losses没展开
  - pretext tasks
    - tasks involving recover：auto-encoder
    - task involving pseudo-labels：通常有个exemplar/anchor，然后计算contrastive loss
  - contrastive learning VS pretext tasks
    - 大量pretext tasks可以通过设计一些contrastive loss来实现
- recent approaches using contrastive loss
  - dynamic dictionaries
    - 由keys组成：sampled from data & represented by an encoder
  - train the encoder to perform dictionary look-up
    - given an encoded query
    - similar to its matching key and dissimilar to others
- desirable dictionary
  - large：better sample
  - consistent：training target consistent
- MoCo：Momentum Contrast
  - queue
  - 每个it step的mini-batch的编码入库
  - the oldest are dequeued
  - EMA：
    - a slowly progressing key encoder
    - momentum-based moving average of the query encoder
  - similar的定义：q & k are from the same image
方法
- contrastive learning
  - a encoded query $q$
  - a set of encoded samples $\{k_0, k_1, …\}$
  - assume：there is a single key $k_+$ in the dictionary that $q$ matches
  - similarity measurement：dot product
  - InfoNCE：
    - $L_q = -log \frac{exp(qk_+/\tau)}{\sum_0^K exp(qk/\tau)}$
    - 1 positive & K negtive samples
    - 本质上是个softmax-based classifier，尝试将$q$分类成$k_+$
  - unsupervised workflow
    - with a encoder network $f_q$ & $f_k$
    - thus we have query & sample representation $q=f_q(x^q)$ & $k=f_k(x^k)$
    - inputs $x$ can be images/patches/context(patches set)
    - $f_q$ & $f_k$ can be identical/partially shared/different
- momentum contrast
  - dictionary as a key
    - the dictionary always represents a sampled subset of all data
    - the current mini-batch入列
    - the oldest mini-batch出列
  - momentum update
    - large dictionary没法对keys进行back-propagation：因为sample太多了
    - only $f_q$ are updated by back-propagation：mini-batch
    - naive solution：copy $f_q$的参数给$f_k$，yields poor results，因为key encoder参数变化太频繁了，representation inconsistent issue
    - momentum update：$f_k = mf_k + (1-m)f_q$，$m=0.999$
    - 三种更新方式对比
      - 第一种end-to-end method：
        
        use samples in current mini-batch as the dictionary
        
        keys are consistently encoded
        
        dictionary size is limited
      - 第二种memory bank
        
        A memory bank consists of the representations of all samples in the dataset
        
        the dictionary for each mini-batch is randomly sampled from the memory bank，不进行bp，thus enables large dictionary
        
        key representation is updated when it was last seen：inconsistent
        
        有些也用momentum update，但是是用在representation上，而不是encoder参数
  - pretext task
    - define positive pair：if the query and the key come from the same image
    - 我们从图上take two random views under random augmentation to form a positive pair
    - 然后用各自的encoder编码成q & k
    - 每一对计算similarity：pos similarity
    - 然后再计算input queries和dictionary的similarity：neg similarity
    - 计算ce，update $f_q$
    - 用$f_q$ update $f_k$
    - 把k加入dictionary队列
    - 把最早的mini-batch出列
    - 技术细节
      - resnet：last fc dim=128，L2 norm
      - temperature $\tau=0.07$
      - augmentation
        
        random resize + random(224,224) crop
        
        random color jittering
        
        random horizontal flip
        
        random grayscale conversion
      - shuffling BN
        
        实验发现使用resnet里面的BN会导致不好的结果：猜测是intra-batch communication引导模型学习了一种cheating的low-loss solution
        
        具体做法是给$f_k$的输入mini-batch先shuffle the order，然后进行fp，然后再shuffle back，这样$f_q$和$f_k$的BN计算的mini-batch的statics就是不同的
实验

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

动机
- simplify recently proposed contrastive self-supervised learning algorithms
- systematically study the major components
  - data augmentations
  - learnable unlinear prediction head
  - larger batch size and more training steps
- outperform previous self-supervised & semi-supervised learning methods on ImageNet
论点
- discriminative approaches based on contrastive learning
  - maximizing agreement between differently augmented views of the same data sample
  - via a contrastive loss in the latent space
- major components & conclusions
  - 数据增强很重要，unsupervised比supervised benefits more
  - 引入的learnable nonlinear transformation提升了representation quality
  - contrastive cross entropy loss受益于normalized embedding和adjusted temperature parameter
  - larger batch size and more training steps很重要，unsupervised比supervised benefits more
方法
- common framework
  - 4 major components
    - 随机数据增强
      - results in two views of the same sample，构成positive pair
      - crop + resize back + color distortions + gaussian blur
    - base encoder
      - 用啥都行，本文用了resnet including the GAP
    - a projection head
      - 将representation dim映射到the space where contrastive loss is applied（given 1 pos pair & N neg pair，就是N+1 dim）
      - 之前有方法直接用linear projection
      - 我们用了带一个hidden layer的MLP：fc-bn-relu-fc
    - a contrastive loss
  - overall workflow
    - random sample a minibatch of N
    - random augmentation results in 2N data points
    - 对每个样本来讲，有1个positive pair，其余2(N-1)个data points都是negative samples
    - set cosine similarity $sim(u,v)=u^Tv/|u||v|$
    - given positive pair $(i,j)$ then the loss is $l_{i,j} = -log \frac{exp(s_{i,j}/\tau)}{\sum_{k\neq i}^{2N} exp(s_{i,k}/\tau)}$
    - 对每个positive pair都计算，包括$(i,j)$ 和$(j,i)$，叫那个symmetrized loss
    - update encoder
- training with large batch size
  - batch 8192，negatives 16382
  - 大batch时，linear learning rate scaling可能不稳定，所以用了LARS optmizer
  - global BN，aggregate BN mean & variance over all devices
  - TPU

MoCo v2: Improved Baselines with Momentum Contrastive Learning

动机
- still working on contrastive unsupervised learning
- simple modifications on MoCo
  - introduce two effective SimCLR’s designs：
  - an MLP head
  - more data augmentation
  - requires smaller batch size than SimCLR，making it possible to run on GPU
- verified on
  - ImageNet classification
  - VOC detection
论点
- MoCo & SimCLR
  - contrastive unsupervised learning frameworks
  - MoCo v1 shows promising
  - SimCLR further reduce the gap
  - we found two design imrpovements in SimCLR 在两个方法中都work，而且用在MoCo中shows better transfer learning results
    - an MLP projection head
    - stronger data augmentation
  - 同时MoCo framework相比较于SimCLR ，远不需要large training batches
    - SimCLR based on end-to-end mechanism，需要比较大的batch size，来提供足够多的negative pair
    - MoCo则用了动态队列，所以不限制batch size
- SimCLR
  - improves the end-to-end method
  - larger batch：to provide more negative samples
  - output layer：replace fc with a MLP head
  - stronger data augmentation
- MoCo
  - a large number of negative samples are readily available
  - 所以就把后两项引入进来了
方法
- MLP head
  - 2-layer MLP(hidden dim=2048, ReLU)
  - 仅影响unsupervised training，有监督transfer learning的时候换头
  - temperature param调整：从default 0.07 调整成optimal value 0.2
- augmentation
  - add blur
  - SimCLR还用了stronger color distortion：we found stronger color distortion in SimCLR hurts in our MoCo，所以没加
实验
- ablation
  - MLP：在分类任务上的提升比检测大
  - augmentation：在检测上的提升比分类大
- comparison
  - large batches are not necessary for good acc：SimCLR longer training那个版本精度更高
  - end-to-end的方法肯定more costly in memory and time：因为要bp两个encoder

MoCo v3: An Empirical Study of Training Self-Supervised Visual Transformers

动机
- self-supervised frameworks that based on Siamese network, including MoCo
- ViT：study the fundamental components for training self-supervised ViT
- MoCo v3：an incremental improvement of MoCo v1/2，striking for a better balance of simplicity & accuracy & scalability
- instability is a major issue
- scaling up ViT models
  - ViT-Large
  - ViT-Huge
论点
- we go back to the basics and investigate the fundamental components of training deep neural networks
  - batch size
  - learning rate
  - optmizer
- instability
  - instability is a major issue that impacts self-supervised ViT training
  - but may not result in catastrophic failure，只会导致精度损失
  - 所以称之为hidden degradation
  - use a simple trick to improve stability：freeze the patch projection layer in ViT
  - and observes increasement in acc
- NLP里面基于masked auto-encoding的framework效果要比基于contrastvie的framework好，图像正好反过来
方法
- MoCo v3
  - take two crops for each image under random augmentation
  - encoded by two encoders $f_q$ & $f_k$ into vectors $q$ & $k$
  - we use the keys that naturally co-exist in the same batch
    - abandon the memory queue：因为发现batch size足够大（4096）的时候，memory queue就没啥acc gain了
    - 回归到batch-based sample pair
  - 但是encoder k仍旧不回传梯度，还是基于encoder q进行动量更新
  - symmetrized loss：
    - $ctr(q_1, k_2) + ctr(q_2,k_1)$
    - InfoNCE
    - temperature
    - 两个crops分别计算ctr
- encoder
  - encoder $f_q$
    - a backbone
    - a projection head
    - an extra prediction head
  - encoder $f_k$
    - a backbone
    - a projection head
  - encoder $f_k$ is updated by the moving average of $f_q$，excluding the prediction head
- baseline acc
  - basic settings，主要变动就是两个：
    - dynamic queue换成large batch
    - encoder $f_q$的extra prediction head
- use ViT
  - 直接用ViT替换resnet back met instability issue
  - batch size
    - ViT里面的一个观点就是，model本身比较heavy，所以large batch is desirable
    - 实验发现
      - a batch of 1k & 2k produces reasonably smooth curves：In this regime, the larger batch improves accuracy thanks to more negative samples
      - a batch of 4k 有明显的untable dips：
      - a batch of 6k has worse failure patterns：我们解读为在跳水点，training is partially restarted and jumps out of the current local optimum
  - learning rate
    - lr较小，training比较稳定，但是容易欠拟合
    - lr过大，会导致unstable，也会影响acc
    - 总体来说精度还是决定于stability
  - optimizer
    - default adamW，batch size 4096
    - 有些方法用了LARS & LAMB for large-batch training
    - LAMB
      - sensitive to lr
      - optmal lr achieves slightly better accuracy than AdamW
      - 但是lr一旦过大，acc极速drop
      - 但是training curves still smooth，虽然中间过程有drop：我们解读为LAMB can avoid sudden change in the gradients，但是避免不了negative compact，还是会累加
  - a trick for improving stability
    - we found a spike in gradient causes a dip in the training curve
    - we also observe that gradient spikes happen earlier in the first layer (patch projection)
    - 所以尝试freezing the patch projection layer during training，也就是一个random的patch projection layer
      - This stability benefits the final accuracy
      - The improvement is bigger for a larger lr
      - 在别的ViT-back-framework上也有效（SimCLR、BYOL）
  - we also tried BN，WN，gradient clip
    - BN/WN does not improve
    - gradient clip在threshold足够小的时候有用，推到极限就是freezing了
- implementation details
  - AdamW
  - batch size 4096
  - lr：warmup 40 eps then cosine decay
- MLP heads
  - projection head：3-layers，4096-BN-ReLU-4096-BN-ReLU-256
  - prediction head：2-layers，4096-BN-ReLU-256
- loss
  - ctr里面有个scale的参数，$2\tau$
  - makes it less sensitive to $\tau$ value
  - $\tau=0.2$
- ViT architecture
  - 跟原论文保持一致
  - 输入是224x244的image，划分成16x16/14x14的patch sequence，project成256d/196d的embedding
  - 加上sine-cosine-2D的PE
  - 再concat一个cls token
  - 经过一系列transformer blocks
  - The class token after the last block (and after the final LayerNorm) is treated as the output of the backbone，and is the input to the MLP heads

optimizers优化器

发表于 2021-03-15 |

0. overview

keywords：SGD, moment, Nesterov, adaptive, ADAM, Weight decay

优化问题Optimization
- to minimize目标函数
- grandient decent
  - gradient
    - numerical：数值法，approx，slow
    - analytical：解析法，exact，fast
  - Stochastic
    - 用minibatch的梯度来approximate全集
    - $\theta_{k+1} = \theta_k - v_{t+1}(x_i,y_i)$
  - classic optimizers：SGD，Momentum，Nesterov‘s momentum
  - adaptive optimizers：AdaGrad，Adadelta，RMSProp，Adam
- Newton

modern optimizers for large-batch
- AdamW
- LARS
- LAMB

common updating steps

for current step t：

step1：计算直接梯度，$g_t = \nabla f(w_t)$

step2：计算一阶动量和二阶动量，$m_t \& V_t$

step3：计算当前时刻的下降梯度，$\eta_t = \alpha m_t/\sqrt {V_t}$

step4：参数更新，$w_{t+1} = w_t - \eta_t$
- 各种优化算法的主要差别在step1和step2上
滑动平均/指数加权平均/moving average/EMA
- 局部均值，与一段时间内的历史相关
- $v_t = \beta v_{t-1}+(1-\beta)\theta_t$，大致等于过去$1/(1-\beta)$个时刻的$\theta$的平均值，但是在起始点附近偏差较大
- $v_{tbiased} = \frac{v_t}{1-\beta^t}$，做了bias correction
- t越大，越不需要修正，两个滑动均值的结果越接近
- 优缺点：不用保存历史，但是近似
SGD
- SGD没有动量的概念，$m_t=g_t$，$V_t=I^2$，$w_{t+1} = w_t - \alpha g_t$

仅依赖当前计算的梯度
- 缺点：下降速度慢，可能陷在local optima上持续震荡

SGDW (with weight decay)
- 在权重更新的同时进行权重衰减
- $w_{t+1} = (1-\lambda)w_t - \alpha g_t$
- 在SGD form的优化器中weight decay等价于在loss上L2 regularization
- 但是在adaptive form的优化器中是不等价的！！因为historical func（ERM）中regularizer和gradient一起被downscale了，因此not as much as they would get regularized in SGDW
SGD with Momentum
- 引入一阶动量，$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$，使用滑动均值，抑制震荡
- 梯度下降的主要方向是此前累积的下降方向，略微向当前时刻的方向调整
SGD with Nesterov Acceleration
- look ahead SGD-momentum
- 在local minima的时候，四周没有下降的方向，但是如果走一步再看，可能就会找到优化方向
- 先跟着累积动量走一步，求梯度：$g_t = \nabla f(w_t-\alpha m_{t-1}/\sqrt {V_{t-1}})$
- 用这个点的梯度方向来计算滑动平均，并更新梯度
Adagrad
- 引入二阶动量，开启“自适应学习率”，$V_t = \sum_0^t g_k^2$，度量历史更新频率
- 对于经常更新的参数，我们已经积累了大量关于它的知识，不希望被单个样本影响太大，希望学习速率慢一些；对于偶尔更新的参数，我们了解的信息太少，希望能从每个偶然出现的样本身上多学一些，即学习速率大一些
- $\eta_t = \alpha m_t / \sqrt{V_t}$，本质上为每个参数，对学习率分别rescale
- 缺点：二阶动量单调递增，导致学习率单调衰减，可能会使得训练过程提前结束
AdaDelta/RMSProp
- 参考momentum，对二阶动量也计算滑动平均，$V_t = \beta_2 V_{t-1} + (1-\beta_2)g_t^2$
- 避免了二阶动量持续累积、导致训练过程提前结束
Adam
- 集大成者：把一阶动量和二阶动量都用起来，Adaptive Momentum
  - SGD-M在SGD基础上增加了一阶动量
  - AdaGrad和AdaDelta在SGD基础上增加了二阶动量
- 一阶动量滑动平均：$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
- 二阶动量滑动平均：$V_t = \beta_2 V_{t-1} + (1-\beta_2)g_t^2$
Nadam
- look ahead Adam
- 把Nesterov的one step try加上：$g_t = \nabla f(w_t-\alpha m_{t-1}/\sqrt {V_{t-1}})$
- 再Adam更新两个动量
经验超参
- $momentum=0.9$
- $\beta_1=0.9$
- $\beta_2=0.999$
- $m_0 = 0$
- $V_0 = 0$
- 上面的图上可以看出，初期的$m_t$和$V_t$会无限接近于0，此时可以进行误差修正：$factor=\frac{1}{1-\beta^t}$
AdamW
- 在adaptive methods中，解耦weight-decay和loss-based gradient在ERM过程中的绑定downscale的关系
- 实质就是将导数项后移

regnet

发表于 2021-03-11 |

RegNet: Designing Network Design Spaces

动机
- study the network design principles
- design RegNet
- outperforms efficientNet and 5x faster
  - top1 error：20.1 （eff-b5：21.5）
  - larger batch size
  - 1/4 的 train/test latency
论点
- manual network design
  - AlexNet, ResNet family, DenseNet, MobileNet
  - focus on discovering new design choices that improve acc
- the recent popular approach NAS
  - search the best in a fixed search space of possible networks
  - limitations：generalize to new settings，lack of interpretability
- network scaling
  - 上面两个focus on 找出一个basenet for a specific regime
  - scaling rules aims at tuning the optimal network in any target regime
- comparing networks
  - the reliable comparison metric to guide the design process
- our method
  - combines the disadvantages of manual design and NAS
  - first AnyNet
  - then RegNet
方法

mongodb

发表于 2021-03-09 |

download：https://www.mongodb.com/try/download/enterprise

install

# 将解压以后的文件夹放在/usr/local下
sudo mv mongodb-osx-x86_64-4.0.9/ /usr/local/
sudo ln -s mongodb-macos-x86_64-4.4.4 mongodb

# ENV PATH
export PATH=/usr/local/mongodb/bin:$PATH

# 创建日志及数据存放的目录
sudo mkdir -p /usr/local/var/mongodb
sudo mkdir -p /usr/local/var/log/mongodb
sudo chown [amber] /usr/local/var/mongodb
sudo chown [amber] /usr/local/var/log/mongodb

configuration

# 后台启动
mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork

# 控制台启动
mongod --config /usr/local/etc/mongod.conf

# 查看状态
ps aux | grep -v grep | grep mongod

run

1
2
3

# 在db环境下启动一个终端
cd /usr/local/mongodb/bin 
./mongo

original settings

# 显示所有数据的列表
> show dbs
admin   0.000GB
config  0.000GB
local   0.000GB
# 三个系统保留的特殊数据库

# 连接/创建一个指定的数据库
> use local
switched to db local

# 显示当前数据库, 如果没use默认为test
> db
test

# 【！！重要】关闭服务
之前服务器被kill -9强制关闭，数据库丢失了
> use admin
switched to db admin
> db.shutdownServer()
server should be down...

concepts
文档document

一组key-value对，如上面左图中的一行记录，如上面右图中的一个dict
集合collection

一张表，如上面左图和上面右图
主键primary key

唯一主键，ObjectId类型，自定生成，有标准格式

常用命令

10.1 创建/删除/重命名db

# 切换至数据库test1
> use test1

# 插入一条doc, db.COLLECTION_NAME.insert(document)
# db要包含至少一条文档，才能在show dbs的时候显示（才真正创建）
> db.sheet1.insert({'name': img0})

# 显示当前已有数据库
> show dbs

# 删除指定数据库
> use test1
> db.dropDatabase()

# 旧版本(before4.0)重命名：先拷贝一份，在删除旧的
> db.copyDatabase('OLDNAME', 'NEWNAME');
> use old_name
> db.dropDatabase()
# 新版本重命名：dump&restore，这个东西在mongodb tools里面，要另外下载，可执行文件放在bin下
# mongodump   # 将所有数据库导出到bin/dump/以每个db名字命名的文件夹下
# mongodump -h dbhost -d dbname -o dbdirectory
# -h: 服务器地址:端口号
# -d: 需要备份的数据库
# -o: 存放位置（需要已存在）
mongodump -d test -o tmp/
# 在恢复备份数据库的时候换个名字：mongorestore -h dbhost -d dbname path
mongorestore -d test_bkp tmp/test
# 这时候可以看到一个新增了一个叫test_bkp的db

10.2 创建/删除/重命名collection

# 创建：db.createCollection(name, options)
> db.createCollection('case2img')

# 显示已有tables
> show collections

# 不用显示创建，在db insert的时候会自动创建集合
> db.sheet2.insert({"name" : "img2"})

# 删除：db.COLLECTION_NAME.drop()
> db.sheet2.drop()

# 重命名：db.COLLECTION_NAME.renameCollection('NEWNAME')
> db.sheet2.renameCollection('sheet3')
# 复制：db.COLLECTION_NAME.aggregate({$out: 'NEWNAME'})
> db.sheet2.aggregate({ $out : "sheet3" })

10.3 插入/显示/更新/删除document

# 插入
db.COLLECTION_NAME.insert(document)
db.COLLECTION_NAME.save(document)
db.COLLECTION_NAME.insertOne()
db.COLLECTION_NAME.insertMany()

# 显示已有doc
db.COLLECTION_NAME.find()

# 更新doc的部分内容
db.COLLECTION_NAME.update(
   <query>,   # 查询条件
   <update>,  # 更新操作
   {
     upsert: <boolean>,     # if true 如果不存在则插入
     multi: <boolean>,      # find fist/all match
     writeConcern: <document>
   }
)
> db.case2img.insert({"case": "s0", "name": "img0"})
> db.case2img.insert({"case": "s1", "name": "img1"})
> db.case2img.find()
> db.case2img.update({'case': 's1'}, {$set: {'case': 's2', 'name': 'img2'}})
> db.case2img.find()

# 给doc的某个key重命名
db.COLLECTION_NAME.updateMany(
{},
{'$rename': {"old_key": "new_key"}}
)

# 更新整条文档by object_id
db.COLLECTION_NAME.save(
   <document>,
   {
     writeConcern: <document>
   }
)
> db.case2img.save({"_id": ObjectId("60474e4b77e21bad9bd4655a"), "case":"s3", "name":"img3"})

# 删除满足条件的doc
db.COLLECTION_NAME.remove(
   <query>,
   {
     justOne: <boolean>,   # find fist/all match
     writeConcern: <document>
   }
)
> db.case2img.remove({"case": "s0"})

# 删除所有doc
> db.case2img.remove({})

10.4 简单查询find

> db.case2img.insert({"case": "s0", "name": "img0"})
> db.case2img.insert({"case": "s1", "name": "img1"})
> db.case2img.insert({"case": "s2", "name": "img2"})
> db.case2img.insert({"case": "s2", "name": "img3"})

# 查询表中的doc：db.COLLECTION_NAME.find({query})
> db.case2img.find({'case': s2})
> db.case2img.find({'case': 's1'}, {"name":1})   # projection的value在对应的key-value是list的时候有意义

# 格式化显示查询结果：db.COLLECTION_NAME.find({query}).pretty()
> db.case2img.find({'case': s2}).pretty()

# 读取指定数量的数据记录：db.COLLECTION_NAME.find({query}).limit(NUMBER)
> db.case2img.find({'case': {$type: 'string'}}).limit(1)

# 跳过指定数量的数据：db.COLLECTION_NAME.find({query}).skip(NUMBER)
> db.case2img.find({'case': {$type: 'string'}}).skip(1)

10.5 条件操作符

(>) 大于 - $gt
(<) 小于 - $lt
(>=) 大于等于 - $gte
(<=) 小于等于 - $lte
(or) 或 - $or

> db.case2img.update({'case':'s1'}, {$set: {"name":'img1', 'size':100}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.case2img.update({'case':'s2'}, {$set: {"name":'img2', 'size':200}})

# 查询size>150的doc
> db.case2img.find({'size': {$gt: 150}})

# 查询满足任意一个条件的doc
> db.case2img.find({'$or': [{'case':'s1'}, {'size': {$gt: 150}}]})

10.6 数据类型操作符

type(KEY)等于 - $type

# 比较对象可以是字符串/对应的reflect NUM
> db.case2img.find({'case': {$type: 'string'}})
> db.case2img.find({'case': {$type: '0'}})

10.7 排序find().sort

# 通过指定字段&指定升序/降序来对数据排序：db.COLLECTION_NAME.find().sort({KEY:1/-1})
> db.case2img.find().sort({'name':1})

# skip(), limilt(), sort()三个放在一起执行的时候，执行的顺序是先 sort(), 然后是 skip()，最后是显示的 limit()。

10.8 索引

skip

10.9 聚合aggregate

# 用于统计 db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

# by group
> db.case2img.aggregate([{$group: {_id: '$case', img_num:{$sum:1}}}])
group by key value 'case'
count number of items in each group
refer to the number as img_num

> db.case2img.aggregate([{$group: {_id: '$case', img_num:{$sum:'$size'}}}])
计算每一个group内，size值的总和

# by match
> db.case2img.aggregate([{$match: {'size': {$gt:150}}}, 
												 {$group:{_id: null, totalsize: {$sum: '$size'}}}])
类似shell的管道，match用来筛选条件，符合条件的送入下一步统计

> db.case2img.aggregate([{$skip: 4}, 
												 {$group:{_id: null, totalsize: {$sum: '$size'}}}])

快速统计distinct

1 2	db.case2img.distinct(TAG_NAME) # 注意如果distinct的内容太长，超过16M，会报distinct too big的error，推荐用聚合来做统计

	
12. pymongo

   用python代码来操作数据库

   先安装：pip install pymongo

   11.1 连接client

   ```python
   from pymongo import MongoClient
   Client = MongoClient()

11.2 获取数据库

1 2	db = Client.DB_NAME db = Client['DB_NAME']

11.3 获取collection

1 2	collection = db.COLLECTION_NAME collection = db['COLLECTION_NAME']

11.4 插入doc

# insert one
document1 = {'x':1}
document2 = {'x':2}
post_1 = collection.insert_one(document1).inserted_id
post_2 = collection.insert_one(document2).inserted_id
print(post_1)

# insert many
new_document = [{'x':1},{'x':2}]
# new_document = [document1,document2]  注意doc是神拷贝，只能作为一条doc被插入一次
result = collection.insert_many(new_document).inserted_ids
print(result)

11.5 查找

from bson.objectid import ObjectId

# find one 返回一条doc
result = collection.find_one()
result = collection.find_one({'case': 's0'})
result = collection.find_one({'_id': ObjectId('604752f277e21bad9bd46560')})

# find 返回一个迭代器
for _, item in enumerate(collection.find()):
    print(item)

11.6 更新

# update one
collection.update_one({'case':'s1'},{'$set':{'size':300}})
collection.update_one({'case':'s1'},{'$push':{'add':1}})    # 追加数组内容

# update many
collection.update_many({'case':'s1'},{'$set':{'size':300}})

11.7 删除

1
2
3

# 在mongo shell里面是remove方法，在pymongo里面被deprecated成delete方法
collection.delete_one({"case": "s2"})
collection.delete_many({"case": "s1"})

11.8 统计

# 计数：count方法已经被重构
print(collection.count_documents({'case':'s0'}))

# unique：distinct方法
print(collection.distinct('case'))

11.9 正则

mongo shell命令行里的正则和pymongo脚本里的正则写法是不一样的，因为python里面有封装正则方法，然后通过bson将python的正则转换成数据库的正则

# pymongo
import re
import bson

pattern = re.compile(r'(.*)-0[345]-(.*)')
regex = bson.regex.Regex.from_native(pattern)
result = collection.aggregate([{'$match': {'date': regex}}])

# mongo shell
> db.collection.find({date:{$regex:"(.*)-0[345]-(.*)"}})

docker

发表于 2021-03-04 |

shartup
- 部署方案
  - 古早年代
  - 虚拟机
  - docker
- image镜像 & container容器 & registry仓库
  - 镜像：相当于是一个 root 文件系统，提供容器运行时所需的程序、库、资源、配置等
  - 容器：镜像运行时的实体，可以被创建、启动、停止、删除、暂停等
  - 仓库：用来保存镜像
    - 官方仓库：docker hub：https://hub.docker.com/r/floydhub/tensorflow/tags?page=1&ordering=last_updated
常用命令
- 拉镜像
  - docker pull [选项] [Docker Registry 地址[:端口号]/]仓库名[:标签]
  - 地址可以是官方地址，也可以是第三方（如Harbor）
  - 仓库名由作者名和软件名组成（如zhangruiming/skin）
  - 标签用来指定某个版本的image，省略则默认latest
- 列出所有镜像
  - docker images
- 删除镜像
  - docker rmi [-f] [镜像id]
  - 删除镜像之前要kill/rm所有使用该镜像的container：docker rm [容器id]
- 运行镜像并创建一个容器
  - docker run [-it] [仓库名] [命令]
  - 选项 -it：为容器配置一个交互终端
  - 选项 -d：后台运行容器，并返回容器ID（不直接进入终端）
  - 选项 —name=’xxx’：为容器指定一个名称
  - 选项-v /host_dir:/container_dir：将主机上指定目录映射到容器的指定目录
  - [命令]参数必须要加，而且要是那种一直挂起的命令（/bin/bash），如果是ls/cd/直接不填，那么命令运行完容器就会停止运行，docker ps -a查看状态，发现都是Exited
- 创建容器
  - docker run
- 查看所有容器
  - docker ps
- 启动一个已经停止的容器/停止正在运行的容器
  - docker start [容器id]
  - docker stop [容器id]
- 进入容器
  - docker exec -it [容器id] [linux命令]
- 删除容器
  - docker rm [容器id]
- 删除所有不活跃的容器
  - docker container prune
- 提交镜像到远端仓库
  - docker tag [镜像id] [用户名]/[仓库]:[标签] # 重命名
  - docker login # 登陆用户
  - docker push

案例

# 拉镜像
docker pull 地址/仓库:标签

# 显示镜像
docker images

# 运行指定镜像
docker run -itd --name='test' 地址/仓库:标签

# 查看运行的容器
docker ps

# 进入容器
docker exec -it 容器id或name /bin/bash

# 一顿操作完退出容器
exit

# 将修改后的容器保存为镜像
docker commit 容器id或name 新镜像名字
docker images可以看到这个镜像了

# 保存镜像到本地
docker save -o tf_torch.rar  tf_torch

# 还原镜像
docker load --input tf_torch.tar

# 重命名镜像
docker tag 3db0b2f40a70 amberzzzz/tf1.14_torch1.4_cuda10.0:v1

# 提交镜像
docker push amberzzzz/tf1.14-torch0.5-cuda10.0:v1

dockerfile

Dockerfile 是用来说明如何自动构建 docker image 的指令集文件
常用命令
- FROM image_name，指定依赖的镜像
- RUN command，在 shell 或者 exec 的环境下执行的命令
- COPY srcfile_path_inhost dstfile_incontainer，将本机文件复制到容器中
- ADD srcfile_path dstfile_incontainer，将本机文件复制到容器中，src文件不仅可以是local host，也可以是网络地址
- CMD [“executable”,”param1”,”param2”]，指定容器启动默认执行的命令
- WORKDIR path_incontainer，指定 RUN、CMD 与 ENTRYPOINT 命令的工作目录
- VOLUME [“/data”]，授权访问从容器内到主机上的目录

basic image

从nvidia docker开始：https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated&name=10.
选一个喜欢的：如docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

然后编辑dockerfile

FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
MAINTAINER amber <amber.zhang@tum.de>

# install basic dependencies
RUN apt-get update 
RUN apt-get install -y wget vim cmake

# install Anaconda3
RUN wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda3.sh
RUN bash ~/anaconda3.sh -b -p /home/anaconda3 && rm ~/anaconda3.sh 
ENV PATH /home/anaconda3/bin:$PATH
# RUN echo "export PATH=/home/anaconda3/bin:$PATH" >> ~/.bashrc && /bin/bash -c "source /root/.bashrc" 

# change mirror
RUN mkdir ~/.pip \
    && cd ~/.pip    
RUN echo '[global]\nindex-url = https://pypi.tuna.tsinghua.edu.cn/simple/' >> ~/.pip/pip.conf

# install tensorflow
RUN /home/anaconda3/bin/pip install tensorflow-gpu==1.8.0

      * 然后build dockerfile

          1
docker build -t <docker_name> .

layer norm

发表于 2021-03-02 |

综述

papers

[batch norm 2015] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift，inceptionV2，Google Team，归一化层的始祖，加速训练&正则，BN被后辈追着打的主要痛点：approximation by mini-batch，test phase frozen

[layer norm 2016] Layer Normalization，Toronto+Google，针对BN不适用small batch和RNN的问题，主要用于RNN，在CNN上不好，在test的时候也是active的，因为mean&variance由于当前数据决定，有负责rescale和reshift的layer params

[weight norm 2016] Weight normalization: A simple reparameterization to accelerate training of deep neural networks，OpenAI，

[cosine norm 2017] Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks，中科院，

[instance norm 2017] Instance Normalization: The Missing Ingredient for Fast Stylization，高校report，针对风格迁移，IN在test的时候也是active的，而不是freeze的，单纯的instance-independent norm，没有layer params

[group norm 2018] Group Normalization，FAIR Kaiming，针对BN在small batch上性能下降的问题，提出batch-independent的

[weight standardization 2019] Weight Standardization，Johns Hopkins，

[batch-channel normalization & weight standardization 2020] BCN&WS: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization，Johns Hopkins，

why Normalization
- 独立同分布：independent and identically distribute
- 白化：whitening（[PCA whitening][http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/]）
  - 去除特征之间的相关性
  - 使所有特征具有相同的均值和方差
- 样本分布变化：Internal Covariate Shift
  - 对于神经网络的各层输入，由于stacking internel byproduct，每层的分布显然各不相同，但是对于某个特定的样本输入，他们所指示的label是不变的
  - 即源空间和目标空间的条件概率是一致的，但是边缘概率是不同的
    $P_s(Y|X=x) = P_t(Y|X=x) \\ P_s(X) \neq P_t(X)$
  - 每个神经元的数据不再是独立同分布，网络需要不断适应新的分布，上层神经元容易饱和：网络训练又慢又不稳定
how to Normalization
- preparation
  - unit：一个神经元（一个op），输入[b,N,C_in]，输出[b,N,1]
  - layer：一层的神经元（一系列op，$W\in R^{M*N}$），在channel-dim上concat当前层所有unit的输出[b,N,C_out]
  - dims
    - b：batch dimension
    - N：spatial dimension，1/2/3-dims
    - C：channel dimension
  - unified representation：本质上都是对数据在规范化
    - $h = f(g*\frac{x-\mu}{\sigma}+b)$：先归一化，在rescale & reshift
    - $\mu$ & $\sigma$：compute from上一层的特征值
    - $g$ & $b$：learnable params基于当前层
    - $f$：neurons’ weighting operation
    - 各方法的主要区别在于mean & variance的计算维度
- 对数据
  - BN：以一层每个神经元的输出为单位，即每个channel的mean&var相互独立
  - LN：以一层所有神经元的输出为单位，即每个sample的mean&var相互独立
  - IN：以每个sample在每个神经元的输出为单位，每个sample在每个channel的mean&var都相互独立
  - GN：以每个sample在一组神经元的输出为单位，一组包含一个神经元的时候变成IN，一组包含一层所有神经元的时候就是LN
  - 示意图：
- 对权重
  - WN：将权重分解为单位向量和一个固定标量，相当于神经元的任意输入vec点乘了一个单位vec（downscale），再rescale，进一步地相当于没有做shift和reshift的数据normalization
  - WS：对权重做全套（归一化再recale），比WN多了shift，“zero-center is the key”
- 对op
  - CosN：
    - 将线性变换op替换成cos op：$f_w(x) = cos = \frac{w \cdot x}{|w||x|}$
    - 数学本质上又退化成了只有downscale的变换，表征能力不足

Whitening白化

purpose
- images的adjacent pixel values are highly correlated，thus redundant
- linearly move the origin distribution，making the inputs share the same mean & variance
method
- 首先进行PCA预处理，去掉correlation
  - mean on sample（注意不是mean on image）
    $\overline x = \frac{1}{N}\sum_{i=1}^N x_i\\ x^{'} = x - \overline x$
  - 协方差矩阵
    $X \in R^{d*N}\\ S = \frac{1}{N}XX^T$
  - 奇异值分解svd(S)
    - $\Sigma$为对角矩阵，对角上的元素为奇异值
    - $U=[u_1,u_2,…u_N]$中是奇异值对应的正交向量
  - 投影变换
    - 取投影矩阵$U_p$ from $U$，$U_p \in R^{N*d}$表示将数据空间从N维投影到$U_p$所在的d维空间上
  - recover（投影逆变换）
    $X^{''} = U_p^T X^{'}$

        * 取投影矩阵$U_r=U_p^T$，就是将 数据空间从d维空间再投影回N维空间上

* PCA白化：

    * 对PCA投影后的新坐标，做归一化处理：基于特征值进行缩放
        $$
        X_{PCAwhite} = \Sigma^{-\frac{1}{2}}X^{'} =  \Sigma^{-\frac{1}{2}}U^TX
        $$

    * $X_{PCAwhite}$的协方差矩阵$S_{PCAwhite} = I$，因此是去了correlation的

* ZCA白化：在上一步做完之后，再把它变换到原始空间，所以ZCA白化后的特征图更接近原始数据

    * 对PCA白化后的数据，再做一步recover
        $$
        X_{ZCAwhite} = U X_{PCAwhite}
        $$

    * 协方差矩阵仍旧是I，合法白化

Layer Normalization

动机
- BN reduces training time
  - compute by each neuron
  - require moving average
  - depend on mini-batch size
  - how to apply to recurrent neural nets
- propose layer norm
  - [unlike BN] compute by each layer
  - [like BN] with adaptive bias & gain
  - [unlike BN] perform the same computation at training & test time
  - [unlike BN] straightforward to apply to recurrent nets
  - work well for RNNs
论点
- BN
  - reduce training time & serves as regularizer
  - require moving average：introduce dependencies between training cases
  - the approxmation of mean & variance expectations constraints on the size of a mini-batch
- intuition
  - norm layer提升训练速度的核心是限制神经元输入输出的变化幅度，稳定梯度
  - 只要控制数据分布，就能保持训练速度
方法
- compute over all hidden units in the same layer
- different training cases have different normalization terms
- 没啥好说的，就是在channel维度计算norm
- further的GN把channel维度分组做norm，IN在直接每个特征计算norm
- gain & bias
  - 也是在对应维度：(hwd)c-dim
  - https://tobiaslee.top/2019/11/21/understanding-layernorm/
  - 后续有实验发现，去掉两个learnable rescale params反而提点
  - 考虑是在training set上的过拟合
实验
- RNN上有用
- CNN上比没有norm layer好，但是没有BN好：因为channel是特征维度，特征维度之间有明显的有用/没用，不能简单的norm

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

动机
- reparameterizing the weights
  - decouple length & direction
  - no dependency between samples which suits well for
    - recurrent
    - reinforcement
    - generative
- no additional memory and computation
- testified on
  - MLP with CIFAR
  - generative model VAE & DRAW
  - reinforcement DQN
论点
- a neuron：
  - get inputs from former layers(neurons)
  - weighted sum over the inputs
  - add a bias
  - elementwise nonlinear transformation
  - batch outputs：one value per sample
- intuition of normalization：
  - give gradients that are more like whitened natural gradients
  - BN：make the outputs of each neuron服从std norm
  - our WN：
    - inspired by BN
    - does not share BN’s across-sample property
    - no addition memory and tiny addition computation

Instance Normalization: The Missing Ingredient for Fast Stylization

动机
- stylization：针对风格迁移网络
- with a small change：swapping BN with IN
- achieve qualitative improvement
论点
- stylized image
  - a content image + a style image
  - both style and content statistics are obtained from a pretrained CNN for image classification
  - methods
    - optimization-based：iterative thus computationally inefficient
    - generator-based：single pass but never as good as
- our work
  - revisit the feed-forward method
  - replace BN in the generator with IN
  - keep them at test time as opposed to freeze
方法
- formulation
  - given a fixed stype image $x_0$
  - given a set of content images $x_t, t= 1,2,…,n$
  - given a pre-trained CNN
  - with a variable z controlling the generation of stylization results
  - compute the stylied image g($x_t$, z)
  - compare the statistics：$min_g \frac{1}{n} \sum^n_{t=1} L(x_0, x_t, g(x_t, z))$
  - comparing target：the contrast of the stylized image is similar to the constrast of the style image
- observations
  - the more training examples, the poorer the qualitive results
  - the result of stylization still depent on the constrast of the content image
- intuition
  - 风格迁移本质上就是将style image的contrast用在content image的：也就是rescale content image的contrast
  - constrast是per sample的：$\frac{pixel}{\sum pixels\ on\ the\ map}$
  - BN在norm的时候将batch samples搅合在了一起
- IN
  - instance-specfic normalization
  - also known as contrast normalization
  - 就是per image做标准化，没有trainable/frozen params，在test phase也一样用

Group Normalization

动机
- for small batch size
- do normalization in channel groups
- batch-independent
- behaves stably over different batch sizes
- approach BN’s accuracy
论点
- BN
  - requires sufficiently large batch size (e.g. 32)
  - Mask R-CNN frameworks use a batch size of 1 or 2 images because of higher resolution, where BN is “frozen” by transforming to a linear layer
  - synchronized BN 、BR
- LN & IN
  - effective for training sequential models or generative models
  - but have limited success in visual recognition
  - GN能转换成LN／IN
- WN
  - normalize the filter weights, instead of operating on features
方法
- group
  - it is not necessary to think of deep neural network features as unstructured vectors
    - 第一层卷积核通常存在一组对称的filter，这样就能捕获到相似特征
    - 这些特征对应的channel can be normalized together
- normalization
  - transform the feature x：$\hat x_i = \frac{1}{\sigma}(x_i-\mu_i)$
  - the mean and the standard deviation：
    $\mu_i=\frac{1}{m}\sum_{k\in S_i}x_k\\ \sigma_i=\sqrt {\frac{1}{m}\sum_{k\in S_i}(x_k-\mu_i)^2+\epsilon}$
  - the set $S_i$
    - BN：
      - $S_i=\{k|k_C = i_C\}$
      - pixels sharing the same channel index are normalized together
      - for each channel, BN computes μ and σ along the (N, H, W) axes
    - LN
      - $S_i=\{k|k_N = i_N\}$
      - pixels sharing the same batch index (per sample) are normalized together
      - LN computes μ and σ along the (C,H,W) axes for each sample
    - IN
      - $S_i=\{k|k_N = i_N, k_C=i_C\}$
      - pixels sharing the same batch index and the same channel index are normalized together
      - LN computes μ and σ along the (H,W) axes for each sample
    - GN
      - $S_i=\{k|k_N = i_N, [\frac{k_C}{C/G}]=[\frac{i_C}{C/G}]\}$
      - computes μ and σ along the (H, W ) axes and along a group of C/G channels
  - linear transform
    - to keep representational ability
    - per channel
    - scale and shift：$y_i = \gamma \hat x_i + \beta$
- relation
  - to LN
    - LN assumes all channels in a layer make “similar contributions”
    - which is less valid with the presence of convolutions
    - GN improved representational power over LN
  - to IN
    - IN can only rely on the spatial dimension for computing the mean and variance
    - it misses the opportunity of exploiting the channel dependence
    - 【QUESTION】BN也没考虑通道间的联系啊，但是计算mean和variance时跨了sample
- implementation
  - reshape
  - learnable $\gamma \& \beta$
  - computable mean & var
实验
- GN相比于BN，training error更低，但是val error略高于BN
  - GN is effective for easing optimization
  - loses some regularization ability
  - it is possible that GN combined with a suitable regularizer will improve results
- 选取不同的group数，所有的group>1均好于group=1（LN）
- 选取不同的channel数（C／G），所有的channel>1均好于channel=1（IN）
- Object Detection
  - frozen：因为higher resolution，batch size通常设置为2/GPU，这时的BN frozen成一个线性层$y=\gamma(x-\mu)/\sigma+beta$，其中的$\mu$和$sigma$是load了pre-trained model中保存的值，并且frozen掉，不再更新
  - denote as BN*
  - replace BN* with GN during fine-tuning
  - use a weight decay of 0 for the γ and β parameters

WS: Weight Standardization

动机
- accelerate training
- micro-batch：
  - 以BN with large-batch为基准
  - 目前BN with micro-batch及其他normalization methods都不能match这个baseline
- operates on weights instead of activations
- 效果
  - match or outperform BN
  - smooth the loss
论点
- two facts
  - BN的performance gain与reduction of internal covariate shift没什么关系
  - BN使得optimization landscape significantly smoother
  - 因此our target is to find another technique
    - achieves smooth landscape
    - work with micro-batch
- normalization methods
  - focus on activations
    - 不展开
  - focus on weights
    - WN：just length-direction decoupling
方法
- Lipschitz constants
  - BN reduces the Lipschitz constants of the loss function
  - makes the gradient more Lipschitz
  - BN considers the Lipschitz constants with respect to activations，not the weights that the optimizer is directly optimizing
- our inspiration
  - standardize the weights也同样能够smooth the landscape
  - 更直接
  - smoothing effects on activations and weights是可以累积的，因为是线性运算
- Weight Standardization
  - reparameterize the original weights $W$
    - 对卷积层的权重参数做变换，no bias
    - $W \in R^{O * I}$
    - $O=C_{out}$
    - $I=C_{in}*kernel_size$
  - optimize the loss on $\hat W$
  - compute mean & var on I-dim
  - 只做标准化，无需affine，因为默认后续还要接一个normalization layer对神经元进行refine
- WS normalizes gradients
  - 拆解：
    - eq5：$W$ to $\dot W$，减均值，zero-centered
    - eq6：$\dot W$ to $\hat W$，除方差，one-varianced
    - eq8：$\delta \hat W$由前一步的梯度normalize得到
    - eq9：$\delta \dot W$也由前一步的梯度normalize
    - 最终用于梯度更新的梯度是zero-centered
- WS smooths landscape
  - 判定是否smooth就看Lipschitz constant的大小
  - eq5和eq6都能reduce the Lipschitz constant
  - 其中eq5 makes the major improvements
  - eq6 slightly improves，因为计算量不大，所以保留
实验
- ImageNet
  - BN的batchsize是64，其余都是1，其余的梯度更新iterations改成64——使得参数更新次数同步
  - 所有的normalization methods加上WS都有提升
  - 裸的normalization methods里面batchsize1的GN最好，所以选用GN+WS做进一步实验
  - GN+WS+AF：加上conv weight的affine会harm
code

# official release
# 放在WSConv2D子类的call里面
kernel_mean = tf.math.reduce_mean(kernel, axis=[0, 1, 2], keepdims=True, name='kernel_mean')
kernel = kernel - kernel_mean
kernel_std = tf.keras.backend.std(kernel, axis=[0, 1, 2], keepdims=True)
kernel = kernel / (kernel_std + 1e-5)

NFNet

发表于 2021-02-22 |

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

动机
- NF：
  - normalization-free
  - aims to match the test acc of batch-normalized networks
    - attain new SOTA 86.5%
    - pre-training + fine-tuning上也表现更好89.2%
- batch normalization
  - 不是完美解决方案
  - depends on batch size
- non-normalized networks
  - accuracy
  - instabilities：develop adaptive gradient clipping
论点
- vast majority models
  - variants of deep residual + BN
  - allow deeper, stable and regularizing
- disadvantages of batch normalization
  - computational expensive
  - introduces discrepancy between training & testing models & increase params
  - breaks the independence among samples
- methods seeks to replace BN
  - alternative normalizers
  - study the origin benefits of BN
  - train deep ResNets without normalization layers
- key theme when removing normalization
  - suppress the scale of the residual branch
  - simplest way：apply a learnable scalar
  - recent work：suppress the branch at initialization & apply Scaled Weight Standardization，能追上ResNet家族，但是没追上Eff家族
- our NFNets’ main contributions
  - propose AGC：解决unstable问题，allow larger batch size and stronger augmentatons
  - NFNets家族刷新SOTA：又快又准
  - pretraining + finetuning的成绩也比batch normed models好
方法
- Understanding Batch Normalization
  - four main benefits
    - downscale the residual branch：从initialization就保证残差分支的scale比较小，使得网络has well-behaved gradients early in training，从而efficient optimization
    - eliminates mean-shift：ReLU是不对称的，stacking layers以后数据分布会累积偏移
    - regularizing effect：mini-batch作为subset对于全集是有偏的，这种noise可以看作是regularizer
    - allows efficient large-batch training：数据分布稳定所以loss变化稳定，同时大batch更接近真实分布，因此我们可以使用更大的learning rate，但是这个property仅在使用大batch size的时候有效
- NF-ResNets
  - recovering the benefits of BN：对residual branch进行scale和mean-shift
- residual block：$h_{i+1} = h_i + \alpha f_i (h_i/\beta_i)$
- $\beta_i = Var(h_i)$：对输入进行标准化（方差为1），这是个expected value，不是算出来的，结构定死就定死了
- Scaled Weight Standardization & scaled activation
  - 比原版的WS多了一个$\sqrt N$的分母
    - 源码实现中比原版WS还多了learnable affine gain
  - 使得conv-relu以后输出还是标准分布
- $\alpha=0.2$：rescale
- residual branch上，最终的输出为$\alpha*$标准分布，方差是$\alpha^2$
- id path上，输出还是$h_{i}$，方差是$Var(h_i)$
  - update这个block输出的方差为$Var(h_{i+1}) = Var(h_i)+\alpha^2$，来更新下一个block的 $\beta$
  - variance reset
    - 每个transition block以后，把variance重新设定为$1+\alpha^2$
      - 在接下来的non-transition block中，用上面的update公式更新expected std
  - 再加上additional regularization（Dropout和Stochastic Depth两种正则手段），就满足了BN benefits的前三条
    - 在batch size较小的时候能够catch up甚至超越batch normalized models
      - 但是large batch size的时候perform worse
  - 对于一个标准的conv-bn-relu，从workflow上看
    - origin：input——一个free的conv weighting——BN（norm & rescale）——activation
      - NFNet：input——standard norm——normed weighting & activation——rescale

Adaptive Gradient Clipping for Efficient Large-Batch Training
- 梯度裁剪：
  - clip by norm：用一个clipping threshold $\lambda$ 进行rescale，training stability was extremely sensitive to 超参的选择，settings（model depth, the batch size, or the learning rate）一变超参就要重新调
clip by value：用一个clipping value进行上下限截断
AGC
- given 某层的权重$W \in R^{NM}$ 和对应梯度$G \in R^{NM}$
- ratio $\frac{||G||_F}{||W||_F}$ 可以看作是梯度变化大小的measurement
- 所以我们直观地想到将这个ratio进行限幅：所谓的adaptive就是在梯度裁剪的时候不是对所有梯度一刀切，而是考虑其对应权重大小，从而进行更合理的调节
  - 但是实验中发现unit-wise的gradient norm要比layer-wise的好：每个unit就是每行，对于conv weights就是(hxwxCin)中的一个
- scalar hyperparameter $\lambda$
```
  * the optimal value may depend on the choice of optimizer, learning rate and batch size
  * empirically we found $\lambda$ should be smaller for larger batches
```
  - ablations for AGC
    - 用pre-activation NF-ResNet-50 和 NF-ResNet-200 做实验，batch size选择从256到4096，学习率从0.1开始基于batch size线性增长，超参$\lambda$的取值见右图
      - 左图结论1：在batch size较小的情况下，NF-Nets能够追上甚至超越normed models的精度，但是batch size一大（2048）情况就恶化了，但是有AGC的NF-Nets则能够maintaining performance comparable or better than～～～
      - 左图结论2：the benefits of using AGC are smaller when the batch size is small
  - 右图结论1：超参$\lambda$的取值比较小的时候，我们对梯度的clipping更strong，这对于使用大batch size训练的稳定性来说非常重要
- whether or not AGC is beneficial for all layers
```
      * it is always better to not clip the final linear layer 
  * 最开始的卷积不做梯度裁剪也能稳定训练
```
  - 最终we apply AGC to every layer except for the final linear layer
- Normalizer-Free Architectures
- begin with SE-ResNeXt-D model
- about group width
```
      * set group width to 128
```
  - the reduction in compute density means that 只减少了理论上的FLOPs，没有实际加速
- about stages
```
      * R系列模型加深的时候是非线性增长，疯狂叠加stage3的block数，因为这一层resolution不大，channel也不是最多，兼顾了两侧计算量
```
  - 我们给F0设置为[1,2,6,3]，然后在deeper variants中对每个stage的block数用一个scalar N线形增长
- about width
```
      * 仍旧对stage3下手，[256,512,1536,1536]
      * roughly preserves the training speed
```
  - 一个论点：stage3 is the best place to add capacity，因为deeper enough同时have access to deeper levels同时又比最后一层有slightly higher resolution
- about block
```
  * 实验发现最有用的操作是adding an additional 3 × 3 grouped conv after the first
  * overview
```
- about scaling variants
```
      * eff系列采用的是R、W、D一起增长，因为eff的block比较轻量
```
  - 但是对R系列来说，只增长D和R就够了
  - 补充细节
```
  * 在inference阶段使用比训练阶段slightly higher resolution
```
    - 随着模型加大increase the regularization strength：
      - scale the drop rate of Dropout
      - 调整stochastic depth rate和weight decay则not effective
    - se-block的scale乘个2
    - SGD params:
      - Nesterov=True, momentum=0.9, clipnorm=0.01
      - lr：
        
        先warmup再余弦退火：increase from 0 to 1.6 over 5 epochs, then decay to zero with cosine annealing
        
        余弦退火cosine annealing
    - summary
      - 总结来说，就是拿来一个SE-ResNeXt-D
      - 先做结构上的调整，modified width and depth patterns以及a second spatial convolution，还有drop rate，resolution
      - 再做对梯度的调整：除了最后一个线形分类层以外，全用AGC，$\lambda=0.01$
    - 最后是训练上的trick：strong regularization and data augmentation
  - detailed view of NFBlocks
    - transition block：有下采样的block
      - 残差branch上，bottleneck的narrow ratio是0.5
      - 每个stage的3x3 conv的group width永远是128，而group数目是在随着block width变的
      - skip path接在 $\beta$ downscaling 之后
      - skip path上是avg pooling + 1x1 conv
    - non-transition block：无下采样的block
      - bottleneck-ratio仍旧是0.5
      - 3x3conv的group width仍旧是128
      - skip path接在$\beta$ downscaling 之前
      - skip path就是id

实验

repVGG

发表于 2021-02-09 |

RepVGG: Making VGG-style ConvNets Great Again

动机
- plain ConvNets
  - simply efficient but poor performance
- propose a CNN architecture RepVGG
  - 能够decouple为training-time和inference-time两个结构
  - 通过structure re-paramterization technique
  - inference-time architecture has a VGG-like plain body
- faster
  - 83% faster than ResNet-50 or 101% faster than ResNet-101
- accuracy-speed trade-off
  - reaches over 80% top-1 accuracy
  - outperforms ResNets by a large margin
- verify on classification & semantic segmentation tasks
论点
- well-designed CNN architectures
  - Inception，ResNet，DenseNet，NAS models
  - deliver higher accuracy
  - drawbacks
    - multi-branch designs：slow down inference and reduce memory utilization，对高并行化的设备不友好
    - some components：depthwise & channel shuffle，increase memory access cost
  - MAC(memory access cost) constitutes a large time usage in groupwise convolution：我的groupconv实现里cardinality维度上计算不并行
  - FLOPs并不能precisely reflect actual speed，一些结构看似比old fashioned VGG/resnet的FLOPs少，但实际并没有快
- multi-branch
  - 通常multi-branch model要比plain model表现好
  - 因为makes the model an implicit ensemble of numerous shallower models
  - so that avoids gradient vanishing
  - benefits are all for training
  - drawbacks are undesired for inference
- the proposed RepVGG
  - advantages
    - plain architecture：no branches
    - 3x3 conv & ReLU组成
    - 没有过重的人工设计痕迹
  - training time use identity & 1x1 conv branches
  - at inference time
    - identity 可以看做degraded 1x1 conv
    - 1x1 conv 可以看做degraded 3x3 conv
    - 最终整个conv-bn branches能够整合成一个3x3 conv
    - inference-time model只包含conv和ReLU：没有max pooling！！
    - fewer memory units：分支会占内存，直到分支计算结束，plain结构的memory则是immediately released
方法
- training-time
  - ResNet-like block
    - id + 1x1 conv + 3x3 conv multi-branches
    - use BN in each branch
    - with n blocks, the model can be interpreted as an ensemble of $3^n$ models
    - stride2的block应该没有id path吧？？
  - simply stack serveral blocks to construct the training model
- inference-time
  - re-param
    - inference-time BN也是一个线性计算
    - 两个1x1 conv都可以转换成中通的3x3 kernel，有权/无权
    - 要求各branch has the same strides & padding pixel要对齐
  - architectural specification
    - variety：depth and width
    - does not use maxpooling：只有一种operator：3x3 conv+relu
    - head：GAP + fc / task specific
    - 5 stages
      - 第一个stage处理high resolution，stride2
      - 第五个stage shall have more channels，所以只用一层，save parameters
      - 给倒数第二个stage最多层，考虑params和computation的balance
    - RepVGG-A：[1,2,4,14,1]，用来compete against轻量和中量级model
    - RepVGG-B：deeper in s2,3,4，[1,4,6,16,1]，用来compete against high-performance ones
    - basic width：[64, 128, 256, 512]
      - width multiplier a & b
      - a控制前4个stage宽度，b控制最后一个stage
      - [64a, 128a, 256a, 512b]
      - 第一个stage的宽度只接受变小不接受变大，因为大resolution影响计算量，min(64,64a)
    - further reduce params & computation
      - groupwise 3x3 conv
      - 跳着层换：从第三开始，第三、第五、
      - number of groups：1，2，4 globally
实验
- 分支的作用
- 结构上的微调
  - id path去掉BN
  - 把所有的BN移动到add的后面
  - 每个path加上relu
- ImageNet分类任务上对标其他模型
  - simple augmentation
  - strong：Autoaugment, label smoothing and mixup

amber.zhang

要糖有糖，要猫有猫

GitHub