Less is More


  • 首页

  • 标签

  • 归档

  • 搜索

LV-ViT

发表于 2021-05-21 |

[LV-ViT 2021] Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet,新加坡国立&字节,主体结构还是ViT,deeper+narrower+multi-layer-cnn-patch-projection+auxiliary label&loss

同等参数量下,能够达到与CNN相当的分类精度

  • 26M——84.4% ImageNet top1 acc
  • 56M——85.4% ImageNet top1 acc
  • 150M——86.2% ImageNet top1 acc

ImageNet & ImageNet-1k:The ImageNet dataset consists of more than 14M images, divided into approximately 22k different labels/classes. However the ImageNet challenge is conducted on just 1k high-level categories (probably because 22k is just too much)

Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet

  1. 动机

    • develop a bag of training techniques on vision transformers
    • slightly tune the structure
    • introduce token labeling——a new training objective
    • ImageNet classificaiton task
  2. 论点

    • former ViTs
      • 主要问题就是需要大数据集pretrain,不然精度上不去
      • 然后模型也比较大,need huge computation resources
      • DeiT和T2T-ViT探索了data augmentation/引入additional token,能够在有限的数据集上拉精度
    • our work
      • rely on purely ImageNet-1k data
      • rethink the way of performing patch embedding
      • introduce inductive bias
      • we add a token labeling objective loss beside cls token predition
      • provide practical advice on adjusting vision transformer structures
  3. 方法

    • overview & comparison

      • 主体结构不变,就是增加了两项
      • a MixToken method
      • a token labeling objective

    • review the vision transformer

      • patch embedding
        • 将固定尺寸的图片转换成patch sequence,例如224x224的图片,patch size=16,那就是14x14个small patches
        • 将每个patch(16x16x3=768-dim) linear project成一个token(embedding-dim)
        • concat a class token,构成全部的input tokens
      • position encoding
        • added to input tokens
        • fixed sinusoidal / learnable
      • multi-head self-attention
        • 用来建立long-range dependency
        • multi-heads:所有attention heads的输出在channel-dim上concat,然后linear project回单个head的channel-dim
      • feed-forward layers
        • fc1-activation-fc2
      • score predition layer
        • 只用了cls token对应的输出embedding,其他的discard
    • training techniques

      • network depth

        • add more transformer blocks
        • 同时decrease the hidden dim of FFN
      • explicit inductive bias

        • CNN逐步扩大感受野,擅长提取局部特征,具有天然的平移不变性等
        • transformer被发现failed to capture the low-level and local structures
        • we use convolutions with a smaller stride to provide an overlapped information for each nearby tokens
        • 在patch embedding的时候不是independent crop,而是有overlap
        • 然后用多层conv,逐步扩大感受野,smaller kernel size同时降低了计算量
      • rethinking residual connection

        • 给残差分支add a smaller ratio $\alpha$

        • enhance the residual connection since less information will go to the residual branch

        • improve the generalization ability

      • re-labeling

        • label is not always accurate after cropping
        • situations are worse on smaller images

        • re-assign each image with a K-dim score map,在1k类数据集上K=1000

        • cheap operation compared to teacher-student
        • 这个label是针对whole image的label,是通过另一个预训练模型获取
      • token-labeling

        • based on the dense score map provided by re-labeling,we can assign each patch an individual label
        • auxiliary token labeling loss
          • 每个token都对应了一个K-dim score map
          • 可以计算一个ce
        • given
          • outputs of the transformer $[X^{cls}, X^1, …, X^N]$
          • K-dim score map $[y^1, y^2, …, y^N]$
          • whole image label $y^{cls}$
        • loss
          • auxiliary token labeling loss:$L_{aux} = \frac{1}{N} \sum_1^N CE(X^i, y^i)$
          • cls loss:$L_{cls} = CE(X^{cls}, y^{cls})$
          • total loss:$L_{total} = L_{cls}+\beta L_{aux}$,$\beta=0.5$
      • MixToken

        • 从Mixup&CutMix启发来的
        • 为了确保each token have clear content,我们基于token embedding进行mixup
        • given
          • token sequence $T_1=[t^1_1, t^2_1, …, t^N_1]$ & $T_2=[t^1_2, t^2_2, …, t^N_2]$
          • token labels $y_1=[y^1_1, y^2_1, …, y^N_1]$ & $Y_2=[y^1_2, y^2_2, …, y^N_2]$
          • binary mask M
        • MixToken
          • mixed token sequence:$\hat T = T_1 \odot M + T_2 \odot (1-M)$
          • mixed labels:$\hat Y = Y_1 \odot M + Y_2 \odot (1-M)$
          • mixed cls label:$\hat {Y^{cls}} = \overline M y_1^{cls} + (1-\overline M) y_2^{cls}$,$\overline M$ is the average of $M$
  4. 实验

    • training details

      • AdamW
      • linear lr scaling:larger when use token labeling
      • weight decay
      • dropout:hurts small models,use Stochastic Depth instead

    • Training Technique Analysis

      • more convs in patch embedding

      • enhanced residual

        • smaller scaling factor

          • the weight get larger gradients in residual branch
          • more information can be preserved in main branch
          • better performance
          • faster convergence

      • re-labeling

        • use NFNet-F6 to re-label the ImageNet dataset and obtain the 1000-dimensional score map for each image
        • NFNet-F6 is trained from scratch
        • given input 576x576,获得的score map是18x18x1000(s32)
        • store the top5 probs for each position to save storage
      • MixToken

        • 比baseline的CutMix method要好
        • 同时看到token labeling比relabeling要好

      • token labeling

        • relabeling是在whole image上
        • token labeling是进一步地,在token level添加label和loss
      • augmentation techniques

        • 发现MixUp会hurt

      • Model Scaling

        • 越大越好

memory bank

发表于 2021-05-19 |
  • 2018年的paper
  • official code:https://github.com/zhirongw/lemniscate.pytorch
  • memory bank
  • NCE

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

  1. 动机

    • unsupervised learning
      • can we learn good feature representation that captures apparent similarity among instances instead of classes
      • formulate a non-parametric classification problem at instance-level
      • use noise contrastive estimation
    • our non-parametric model
      • highly compact:128-d feature per image,only 600MB storage in total
      • enable fast nearest neighbour retrieval
    • 【QUESTION】无类别标签,单靠similarity,最终的分类模型是如何建立的?
    • verified on
      • ImageNet 1K classification
      • semi-supervised learning
      • object detection tasks
  2. 论点

    • observations
      • ImageNet top-5 err远比top-1 err小
      • second highest responding class is more likely to be visually related
      • 说明模型隐式地学到了similarity
      • apparent similarity is learned not from se- mantic annotations, but from the visual data themselves
    • 将class-wise supervision推到一个极限
      • 就变成了instance-level
      • 类别数变成了the whole training set:softmax to many more classes becomes infeasible
        • approximate the full softmax distribution with noise-contrastive estimation(NCE)
        • use a proximal regularization to stablize the learning process
    • train & test
      • 通常的做法是learned representations加一个线性分类器
      • e.g. SVM:但是train和test的feature space是不一致的
      • 我们用了KNN:same metric space
  3. 方法

    • overview

      • to learn a embedding function $f_{\theta}$
      • distance metric $d_{\theta}(x,y) = ||f_{\theta}(x)-f_{\theta}(y)||$
      • to map visually similar images closer
      • instance-level:to distinct between instances

    • Non-Parametric Softmax Classifier

      • common parametric classifier

        • given网络预测的N-dim representation $v=f_{\theta}(x)$
        • 要预测C-classes的概率,需要一个$W^{NC}$的projection:$P(i|v) = \frac{exp (W^T_iv)}{\sum exp (W^T_jv)}$
      • Non-Parametric version

        • enforce $||v||=1$ via L2 norm
        • replace $W^T$ with $v^T$
        • then the probability:$P(i|v) = \frac{exp (v^T_iv/\tau)}{\sum exp (v^T_jv / \tau)}$
        • temperature param $\tau$:controls the concentration level of the distribution
        • the goal is to minimize the negative log-likelihood

        • 意义:L2 norm将所有的representation映射到了一个128-d unit sphere上面,$v_i^T v_j$度量了两个projection vec的similarity,我们希望同类的vec尽可能重合,不同类的vec尽可能正交

          • class weights $W$ are not generalized to new classes
          • but feature representations $V$ does
      • memory bank

        • 因为是instance level,C-classes对应整个training set,也就是说${v_i}$ for all the images are needed for loss
        • Let $V={v_i}$ 表示memory bank,初始为unit random vectors
        • every learning iterations
          • $f_\theta$ is optimized by SGD
          • 输入$x_i$所对应的$f_i$更新到$v_i$上
          • 也就是只有mini-batch中包含的样本,在这一个step,更新projection vec
    • Noise-Contrastive Estimation

      • non-parametric softmax的计算量随着样本量线性增长,millions level样本量的情况下,计算太heavy了

      • we use NCE to approximate the full softmax

      • assume

        • noise samples的uniform distribution:$P_n =\frac{1}{n}$
        • noise samples are $m$ times frequent than data samples
      • 那么sample $i$ matches vec $v$的后验概率是:$h(i,v)=\frac{P(i|v)}{P(i|v)+mP_n}$

        • approximated training object is to minimize the negative log-likelihood of $h(i,v)$
      • normalizing constant $Z$的近似

        • 主要就是分母这个$Z_i$的计算比较heavy,我们用Monte Carlo采样来近似:

        • ${j_k}$ is a random subset of indices:随机抽了memory bank的一个子集来approx全集的分母,实验发现取batch size大小的子集就可以,m=4096

    • Proximal Regularization

      • the learning process oscillates a lot

        • we have one instance per class
        • during each training epoch each class is only visited once
      • we introduce an additional term

        • overall workflow:在每一个iteration t,feature representation是$v_i^t=f_{\theta}(x_i)$,而memory bank里面的representations来自上一个iteration step $V={v^{t-1}}$,我们从memory bank里面采样,并计算NCE loss,然后bp更新网络权重,然后将这一轮fp的representations update到memory bank的指定样本上,然后下一轮
        • 可以发现,在初始random阶段,梯度更新会比较快而且不稳定
        • 我们给positive sample的loss上额外加了一个$\lambda ||v_i^t-v_i^{t-1}||^2_2$,有点类似weight decay那种东西,开始阶段l2 loss会占主导,引导网络收敛

        • stabilize

        • speed up convergence
        • improve the learned representations
    • Weighted k-Nearest Neighbor Classifier

      • a test time,先计算feature representation,然后跟memory bank的vectors分别计算cosine similarity $s_i=cos(v_i, f)$,选出topk neighbours $N_k$,然后进行weighted voting
      • weighted voting:
        • 对每个class c,计算它在topk neighbours的total weight,$w_c =\sum_{i \in N_k} \alpha_i 1(c_i=c)$
        • $\alpha_i = exp(s_i/\tau)$
      • k = 200
      • $\tau = 0.07$

MoCo系列

发表于 2021-04-30 |

papers:

[2019 MoCo v1] Momentum Contrast for Unsupervised Visual Representation Learning,kaiming

[2020 SimCLR] A Simple Framework for Contrastive Learning of Visual Representations,Google Brain,混进来是因为它improve based on MoCo v1,而MoCo v2/v3又都是基于它改进

[2020 MoCo v2] Improved Baselines with Momentum Contrastive Learning,kaiming

[2021 MoCo v3] An Empirical Study of Training Self-Supervised Visual Transformers,kaiming

preview: 自监督学习 Self-supervised Learning

  1. reference:https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html

  2. overview

    • 就是无监督
    • 针对的痛点(有监督训练模型)
      • 标注成本高
      • 迁移性差
    • 会基于数据特点,设置Pretext tasks(最常见的任务就是生成/重建),构造Pesdeo Labels来训练网络
    • 通常模型用来作为其他学习任务的预训练模型
    • 被认为是用来学习图像的通用视觉表示
  3. methods

    • 从结构上区分主要就是两大类方法

      • 生成式:通过encoder-decoder结构还原输入,监督信号是输入输出尽可能相似
        • 重建任务开销大
        • 没有建立直接的语义学习
        • 外加GAN的判别器使得任务更加复杂难训
      • 判别式:输入两张图片,通过encoder编码,监督信号是判断两张图是否相似,判别式模型也叫Contrastive Learning

    • 从Pretext tasks上划分主要分为三类

      • 基于上下文(Context based) :如bert的MLM,在句子/图片中随机扣掉一部分,然后推动模型基于上下文/语义信息预测这部分/相对位置关系
      • 基于时序(Temporal Based):如bert的NSP,视频/语音,利用相邻帧的相似性,构建不同排序的序列,判断B是否是A的下一句/是否相邻帧
      • 基于对比(Contrastive Based):比较正负样本,最大化相似度的loss在这里面被叫做InfoNCE
  4. memory-bank

    • Contrastive Based方法最常见的方式是在一个batch中构建正负样本进行对比学习
      • end-to-end
      • 每个mini-batch中的图像增强前后的两张图片互为正样本
      • 字典大小就是minibatch大小
    • memory bank包含数据集中所有样本编码后特征
      • 随机采样一部分作为keys
      • 每个迭代只更新被采样的样本编码
      • 因为样本编码来自不同的training step,一致性差
    • MoCo

      • 动态编码库:out-of-date的编码出列
      • momentum update:一致性提升

  5. InfoNCE

    • deep mind在CPC(Contrastive Predictive Coding)提出,论文以后有机会再展开

      • unsupervised
      • encoder:encode x into latent space representations z,resnet blocks
      • autoregressive model:summarize each time-step set of {z} into a context representation c,GRUs
      • probabilistic contrastive loss

        • Noise-Contrastive Estimation
        • Importance Sampling

    • 训练目标是输入数据x和context vector c之间的mutual information

      • 每次从$p(x_{t+k}|c_t)$中采样一个正样本:正样本是这个序列接下来预测的东西,和c的相似性肯定要高于不想干的token
      • 从$p(x_{t+k})$中采样N-1个负样本:负样本是别的序列里面随机采样的东西
      • 目标是让正样本与context相关性高,负样本低

MoCo v1: Momentum Contrast for Unsupervised Visual Representation Learning

  1. 动机

    • unsupervised visual representation learning
    • contrastive learning
    • dynamic dictionary

      • large
      • consisitent
    • verified on

      • 7 down-stream tasks
      • ImageNet classification
      • VOC & COCO det/seg
  2. 论点

    • Unsupervised representation learning

      • highly successful in NLP,in CV supervised is still the main-stream
      • 两个核心
        • pretext tasks
        • loss functions
      • loss functions
        • 生成式方法的loss是基于prediction和一个fix target来计算的
        • contrastive-based的key target则是vary on-the-fly during training
        • Adversarial losses没展开
      • pretext tasks
        • tasks involving recover:auto-encoder
        • task involving pseudo-labels:通常有个exemplar/anchor,然后计算contrastive loss
      • contrastive learning VS pretext tasks
        • 大量pretext tasks可以通过设计一些contrastive loss来实现
    • recent approaches using contrastive loss

      • dynamic dictionaries
        • 由keys组成:sampled from data & represented by an encoder
      • train the encoder to perform dictionary look-up
        • given an encoded query
        • similar to its matching key and dissimilar to others
    • desirable dictionary

      • large:better sample
      • consistent:training target consistent
    • MoCo:Momentum Contrast

      • queue
      • 每个it step的mini-batch的编码入库
      • the oldest are dequeued
      • EMA:

        • a slowly progressing key encoder
        • momentum-based moving average of the query encoder

      • similar的定义:q & k are from the same image

  3. 方法

    • contrastive learning

      • a encoded query $q$
      • a set of encoded samples $\{k_0, k_1, …\}$
      • assume:there is a single key $k_+$ in the dictionary that $q$ matches
      • similarity measurement:dot product
      • InfoNCE:
        • $L_q = -log \frac{exp(qk_+/\tau)}{\sum_0^K exp(qk/\tau)}$
        • 1 positive & K negtive samples
        • 本质上是个softmax-based classifier,尝试将$q$分类成$k_+$
      • unsupervised workflow
        • with a encoder network $f_q$ & $f_k$
        • thus we have query & sample representation $q=f_q(x^q)$ & $k=f_k(x^k)$
        • inputs $x$ can be images/patches/context(patches set)
        • $f_q$ & $f_k$ can be identical/partially shared/different
    • momentum contrast

      • dictionary as a key

        • the dictionary always represents a sampled subset of all data
        • the current mini-batch入列
        • the oldest mini-batch出列
      • momentum update

        • large dictionary没法对keys进行back-propagation:因为sample太多了

        • only $f_q$ are updated by back-propagation:mini-batch

        • naive solution:copy $f_q$的参数给$f_k$,yields poor results,因为key encoder参数变化太频繁了,representation inconsistent issue

        • momentum update:$f_k = mf_k + (1-m)f_q$,$m=0.999$

        • 三种更新方式对比

          • 第一种end-to-end method:
            • use samples in current mini-batch as the dictionary
            • keys are consistently encoded
            • dictionary size is limited
          • 第二种memory bank
            • A memory bank consists of the representations of all samples in the dataset
            • the dictionary for each mini-batch is randomly sampled from the memory bank,不进行bp,thus enables large dictionary
            • key representation is updated when it was last seen:inconsistent
            • 有些也用momentum update,但是是用在representation上,而不是encoder参数
      • pretext task

        • define positive pair:if the query and the key come from the same image
        • 我们从图上take two random views under random augmentation to form a positive pair
        • 然后用各自的encoder编码成q & k
        • 每一对计算similarity:pos similarity
        • 然后再计算input queries和dictionary的similarity:neg similarity
        • 计算ce,update $f_q$
        • 用$f_q$ update $f_k$
        • 把k加入dictionary队列
        • 把最早的mini-batch出列

        • 技术细节

          • resnet:last fc dim=128,L2 norm
          • temperature $\tau=0.07$
          • augmentation
            • random resize + random(224,224) crop
            • random color jittering
            • random horizontal flip
            • random grayscale conversion
          • shuffling BN
            • 实验发现使用resnet里面的BN会导致不好的结果:猜测是intra-batch communication引导模型学习了一种cheating的low-loss solution
            • 具体做法是给$f_k$的输入mini-batch先shuffle the order,然后进行fp,然后再shuffle back,这样$f_q$和$f_k$的BN计算的mini-batch的statics就是不同的
  4. 实验

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

  1. 动机

    • simplify recently proposed contrastive self-supervised learning algorithms
    • systematically study the major components
      • data augmentations
      • learnable unlinear prediction head
      • larger batch size and more training steps
    • outperform previous self-supervised & semi-supervised learning methods on ImageNet

  2. 论点

    • discriminative approaches based on contrastive learning

      • maximizing agreement between differently augmented views of the same data sample
      • via a contrastive loss in the latent space
    • major components & conclusions

      • 数据增强很重要,unsupervised比supervised benefits more
      • 引入的learnable nonlinear transformation提升了representation quality
      • contrastive cross entropy loss受益于normalized embedding和adjusted temperature parameter
      • larger batch size and more training steps很重要,unsupervised比supervised benefits more
  3. 方法

    • common framework

      • 4 major components

        • 随机数据增强
          • results in two views of the same sample,构成positive pair
          • crop + resize back + color distortions + gaussian blur
        • base encoder
          • 用啥都行,本文用了resnet including the GAP
        • a projection head
          • 将representation dim映射到the space where contrastive loss is applied(given 1 pos pair & N neg pair,就是N+1 dim)
          • 之前有方法直接用linear projection
          • 我们用了带一个hidden layer的MLP:fc-bn-relu-fc
        • a contrastive loss
      • overall workflow

        • random sample a minibatch of N
        • random augmentation results in 2N data points
        • 对每个样本来讲,有1个positive pair,其余2(N-1)个data points都是negative samples
        • set cosine similarity $sim(u,v)=u^Tv/|u||v|$
        • given positive pair $(i,j)$ then the loss is $l_{i,j} = -log \frac{exp(s_{i,j}/\tau)}{\sum_{k\neq i}^{2N} exp(s_{i,k}/\tau)}$
        • 对每个positive pair都计算,包括$(i,j)$ 和$(j,i)$,叫那个symmetrized loss
        • update encoder

    • training with large batch size

      • batch 8192,negatives 16382
      • 大batch时,linear learning rate scaling可能不稳定,所以用了LARS optmizer
      • global BN,aggregate BN mean & variance over all devices
      • TPU

MoCo v2: Improved Baselines with Momentum Contrastive Learning

  1. 动机

    • still working on contrastive unsupervised learning
    • simple modifications on MoCo
      • introduce two effective SimCLR’s designs:
      • an MLP head
      • more data augmentation
      • requires smaller batch size than SimCLR,making it possible to run on GPU
    • verified on
      • ImageNet classification
      • VOC detection
  2. 论点

    • MoCo & SimCLR
      • contrastive unsupervised learning frameworks
      • MoCo v1 shows promising
      • SimCLR further reduce the gap
      • we found two design imrpovements in SimCLR 在两个方法中都work,而且用在MoCo中shows better transfer learning results
        • an MLP projection head
        • stronger data augmentation
      • 同时MoCo framework相比较于SimCLR ,远不需要large training batches
        • SimCLR based on end-to-end mechanism,需要比较大的batch size,来提供足够多的negative pair
        • MoCo则用了动态队列,所以不限制batch size
    • SimCLR
      • improves the end-to-end method
      • larger batch:to provide more negative samples
      • output layer:replace fc with a MLP head
      • stronger data augmentation
    • MoCo

      • a large number of negative samples are readily available
      • 所以就把后两项引入进来了

  3. 方法

    • MLP head

      • 2-layer MLP(hidden dim=2048, ReLU)
      • 仅影响unsupervised training,有监督transfer learning的时候换头
      • temperature param调整:从default 0.07 调整成optimal value 0.2

    • augmentation

      • add blur
      • SimCLR还用了stronger color distortion:we found stronger color distortion in SimCLR hurts in our MoCo,所以没加
  4. 实验

    • ablation

      • MLP:在分类任务上的提升比检测大
      • augmentation:在检测上的提升比分类大

    • comparison

      • large batches are not necessary for good acc:SimCLR longer training那个版本精度更高
      • end-to-end的方法肯定more costly in memory and time:因为要bp两个encoder

MoCo v3: An Empirical Study of Training Self-Supervised Visual Transformers

  1. 动机

    • self-supervised frameworks that based on Siamese network, including MoCo
    • ViT:study the fundamental components for training self-supervised ViT
    • MoCo v3:an incremental improvement of MoCo v1/2,striking for a better balance of simplicity & accuracy & scalability
    • instability is a major issue
    • scaling up ViT models
      • ViT-Large
      • ViT-Huge
  2. 论点

    • we go back to the basics and investigate the fundamental components of training deep neural networks
      • batch size
      • learning rate
      • optmizer
    • instability
      • instability is a major issue that impacts self-supervised ViT training
      • but may not result in catastrophic failure,只会导致精度损失
      • 所以称之为hidden degradation
      • use a simple trick to improve stability:freeze the patch projection layer in ViT
      • and observes increasement in acc
    • NLP里面基于masked auto-encoding的framework效果要比基于contrastvie的framework好,图像正好反过来
  3. 方法

    • MoCo v3

      • take two crops for each image under random augmentation
      • encoded by two encoders $f_q$ & $f_k$ into vectors $q$ & $k$
      • we use the keys that naturally co-exist in the same batch
        • abandon the memory queue:因为发现batch size足够大(4096)的时候,memory queue就没啥acc gain了
        • 回归到batch-based sample pair
      • 但是encoder k仍旧不回传梯度,还是基于encoder q进行动量更新
      • symmetrized loss:

        • $ctr(q_1, k_2) + ctr(q_2,k_1)$
        • InfoNCE
        • temperature
        • 两个crops分别计算ctr

    • encoder

      • encoder $f_q$
        • a backbone
        • a projection head
        • an extra prediction head
      • encoder $f_k$
        • a backbone
        • a projection head
      • encoder $f_k$ is updated by the moving average of $f_q$,excluding the prediction head
    • baseline acc

      • basic settings,主要变动就是两个:
        • dynamic queue换成large batch
        • encoder $f_q$的extra prediction head
    • use ViT

      • 直接用ViT替换resnet back met instability issue

      • batch size

        • ViT里面的一个观点就是,model本身比较heavy,所以large batch is desirable

        • 实验发现

          • a batch of 1k & 2k produces reasonably smooth curves:In this regime, the larger batch improves accuracy thanks to more negative samples
          • a batch of 4k 有明显的untable dips:
          • a batch of 6k has worse failure patterns:我们解读为在跳水点,training is partially restarted and jumps out of the current local optimum

      • learning rate

        • lr较小,training比较稳定,但是容易欠拟合
        • lr过大,会导致unstable,也会影响acc
        • 总体来说精度还是决定于stability

      • optimizer

        • default adamW,batch size 4096
        • 有些方法用了LARS & LAMB for large-batch training
        • LAMB

          • sensitive to lr
          • optmal lr achieves slightly better accuracy than AdamW
          • 但是lr一旦过大,acc极速drop
          • 但是training curves still smooth,虽然中间过程有drop:我们解读为LAMB can avoid sudden change in the gradients,但是避免不了negative compact,还是会累加

      • a trick for improving stability

        • we found a spike in gradient causes a dip in the training curve
        • we also observe that gradient spikes happen earlier in the first layer (patch projection)
        • 所以尝试freezing the patch projection layer during training,也就是一个random的patch projection layer

          • This stability benefits the final accuracy
          • The improvement is bigger for a larger lr
          • 在别的ViT-back-framework上也有效(SimCLR、BYOL)

      • we also tried BN,WN,gradient clip

        • BN/WN does not improve
        • gradient clip在threshold足够小的时候有用,推到极限就是freezing了
    • implementation details

      • AdamW
      • batch size 4096
      • lr:warmup 40 eps then cosine decay
    • MLP heads

      • projection head:3-layers,4096-BN-ReLU-4096-BN-ReLU-256
      • prediction head:2-layers,4096-BN-ReLU-256
    • loss

      • ctr里面有个scale的参数,$2\tau$
      • makes it less sensitive to $\tau$ value
      • $\tau=0.2$
    • ViT architecture

      • 跟原论文保持一致
      • 输入是224x244的image,划分成16x16/14x14的patch sequence,project成256d/196d的embedding
      • 加上sine-cosine-2D的PE
      • 再concat一个cls token
      • 经过一系列transformer blocks
      • The class token after the last block (and after the final LayerNorm) is treated as the output of the backbone,and is the input to the MLP heads

optimizers优化器

发表于 2021-03-15 |

0. overview

keywords:SGD, moment, Nesterov, adaptive, ADAM, Weight decay

  1. 优化问题Optimization
    • to minimize目标函数
    • grandient decent
      • gradient
        • numerical:数值法,approx,slow
        • analytical:解析法,exact,fast
      • Stochastic
        • 用minibatch的梯度来approximate全集
        • $\theta_{k+1} = \theta_k - v_{t+1}(x_i,y_i)$
      • classic optimizers:SGD,Momentum,Nesterov‘s momentum
      • adaptive optimizers:AdaGrad,Adadelta,RMSProp,Adam
    • Newton
  • modern optimizers for large-batch
    • AdamW
    • LARS
    • LAMB
  1. common updating steps

    for current step t:

    step1:计算直接梯度,$g_t = \nabla f(w_t)$

    step2:计算一阶动量和二阶动量,$m_t \& V_t$

    step3:计算当前时刻的下降梯度,$\eta_t = \alpha m_t/\sqrt {V_t}$

    step4:参数更新,$w_{t+1} = w_t - \eta_t$

    • 各种优化算法的主要差别在step1和step2上
  2. 滑动平均/指数加权平均/moving average/EMA

    • 局部均值,与一段时间内的历史相关
    • $v_t = \beta v_{t-1}+(1-\beta)\theta_t$,大致等于过去$1/(1-\beta)$个时刻的$\theta$的平均值,但是在起始点附近偏差较大
    • $v_{tbiased} = \frac{v_t}{1-\beta^t}$,做了bias correction
    • t越大,越不需要修正,两个滑动均值的结果越接近
    • 优缺点:不用保存历史,但是近似

  3. SGD

    • SGD没有动量的概念,$m_t=g_t$,$V_t=I^2$,$w_{t+1} = w_t - \alpha g_t$
  • 仅依赖当前计算的梯度
    • 缺点:下降速度慢,可能陷在local optima上持续震荡
  1. SGDW (with weight decay)

    • 在权重更新的同时进行权重衰减
    • $w_{t+1} = (1-\lambda)w_t - \alpha g_t$
    • 在SGD form的优化器中weight decay等价于在loss上L2 regularization
    • 但是在adaptive form的优化器中是不等价的!!因为historical func(ERM)中regularizer和gradient一起被downscale了,因此not as much as they would get regularized in SGDW
  2. SGD with Momentum

    • 引入一阶动量,$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$,使用滑动均值,抑制震荡
    • 梯度下降的主要方向是此前累积的下降方向,略微向当前时刻的方向调整
  3. SGD with Nesterov Acceleration

    • look ahead SGD-momentum
    • 在local minima的时候,四周没有下降的方向,但是如果走一步再看,可能就会找到优化方向
    • 先跟着累积动量走一步,求梯度:$g_t = \nabla f(w_t-\alpha m_{t-1}/\sqrt {V_{t-1}})$
    • 用这个点的梯度方向来计算滑动平均,并更新梯度
  4. Adagrad

    • 引入二阶动量,开启“自适应学习率”,$V_t = \sum_0^t g_k^2$,度量历史更新频率
    • 对于经常更新的参数,我们已经积累了大量关于它的知识,不希望被单个样本影响太大,希望学习速率慢一些;对于偶尔更新的参数,我们了解的信息太少,希望能从每个偶然出现的样本身上多学一些,即学习速率大一些
    • $\eta_t = \alpha m_t / \sqrt{V_t}$,本质上为每个参数,对学习率分别rescale
    • 缺点:二阶动量单调递增,导致学习率单调衰减,可能会使得训练过程提前结束
  5. AdaDelta/RMSProp

    • 参考momentum,对二阶动量也计算滑动平均,$V_t = \beta_2 V_{t-1} + (1-\beta_2)g_t^2$
    • 避免了二阶动量持续累积、导致训练过程提前结束
  6. Adam

    • 集大成者:把一阶动量和二阶动量都用起来,Adaptive Momentum
      • SGD-M在SGD基础上增加了一阶动量
      • AdaGrad和AdaDelta在SGD基础上增加了二阶动量
    • 一阶动量滑动平均:$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
    • 二阶动量滑动平均:$V_t = \beta_2 V_{t-1} + (1-\beta_2)g_t^2$
  7. Nadam

    • look ahead Adam
    • 把Nesterov的one step try加上:$g_t = \nabla f(w_t-\alpha m_{t-1}/\sqrt {V_{t-1}})$
    • 再Adam更新两个动量
  8. 经验超参

    • $momentum=0.9$
    • $\beta_1=0.9$
    • $\beta_2=0.999$
    • $m_0 = 0$
    • $V_0 = 0$
    • 上面的图上可以看出,初期的$m_t$和$V_t$会无限接近于0,此时可以进行误差修正:$factor=\frac{1}{1-\beta^t}$
  9. AdamW

    • 在adaptive methods中,解耦weight-decay和loss-based gradient在ERM过程中的绑定downscale的关系
    • 实质就是将导数项后移

regnet

发表于 2021-03-11 |

RegNet: Designing Network Design Spaces

  1. 动机

    • study the network design principles
    • design RegNet
    • outperforms efficientNet and 5x faster
      • top1 error:20.1 (eff-b5:21.5)
      • larger batch size
      • 1/4 的 train/test latency
  2. 论点

    • manual network design
      • AlexNet, ResNet family, DenseNet, MobileNet
      • focus on discovering new design choices that improve acc
    • the recent popular approach NAS
      • search the best in a fixed search space of possible networks
      • limitations:generalize to new settings,lack of interpretability
    • network scaling
      • 上面两个focus on 找出一个basenet for a specific regime
      • scaling rules aims at tuning the optimal network in any target regime
    • comparing networks
      • the reliable comparison metric to guide the design process
    • our method
      • combines the disadvantages of manual design and NAS
      • first AnyNet
      • then RegNet
  3. 方法

mongodb

发表于 2021-03-09 |
  1. download:https://www.mongodb.com/try/download/enterprise

  2. install

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 将解压以后的文件夹放在/usr/local下
    sudo mv mongodb-osx-x86_64-4.0.9/ /usr/local/
    sudo ln -s mongodb-macos-x86_64-4.4.4 mongodb

    # ENV PATH
    export PATH=/usr/local/mongodb/bin:$PATH

    # 创建日志及数据存放的目录
    sudo mkdir -p /usr/local/var/mongodb
    sudo mkdir -p /usr/local/var/log/mongodb
    sudo chown [amber] /usr/local/var/mongodb
    sudo chown [amber] /usr/local/var/log/mongodb
  3. configuration

    1
    2
    3
    4
    5
    6
    7
    8
    # 后台启动
    mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork

    # 控制台启动
    mongod --config /usr/local/etc/mongod.conf

    # 查看状态
    ps aux | grep -v grep | grep mongod
  4. run

    1
    2
    3
    # 在db环境下启动一个终端
    cd /usr/local/mongodb/bin
    ./mongo
  5. original settings

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    # 显示所有数据的列表
    > show dbs
    admin 0.000GB
    config 0.000GB
    local 0.000GB
    # 三个系统保留的特殊数据库

    # 连接/创建一个指定的数据库
    > use local
    switched to db local

    # 显示当前数据库, 如果没use默认为test
    > db
    test

    # 【!!重要】关闭服务
    之前服务器被kill -9强制关闭,数据库丢失了
    > use admin
    switched to db admin
    > db.shutdownServer()
    server should be down...
  6. concepts

  7. 文档document

    一组key-value对,如上面左图中的一行记录,如上面右图中的一个dict

  8. 集合collection

    一张表,如上面左图和上面右图

  9. 主键primary key

    唯一主键,ObjectId类型,自定生成,有标准格式

  10. 常用命令

    10.1 创建/删除/重命名db

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    # 切换至数据库test1
    > use test1

    # 插入一条doc, db.COLLECTION_NAME.insert(document)
    # db要包含至少一条文档,才能在show dbs的时候显示(才真正创建)
    > db.sheet1.insert({'name': img0})

    # 显示当前已有数据库
    > show dbs

    # 删除指定数据库
    > use test1
    > db.dropDatabase()

    # 旧版本(before4.0)重命名:先拷贝一份,在删除旧的
    > db.copyDatabase('OLDNAME', 'NEWNAME');
    > use old_name
    > db.dropDatabase()
    # 新版本重命名:dump&restore,这个东西在mongodb tools里面,要另外下载,可执行文件放在bin下
    # mongodump # 将所有数据库导出到bin/dump/以每个db名字命名的文件夹下
    # mongodump -h dbhost -d dbname -o dbdirectory
    # -h: 服务器地址:端口号
    # -d: 需要备份的数据库
    # -o: 存放位置(需要已存在)
    mongodump -d test -o tmp/
    # 在恢复备份数据库的时候换个名字:mongorestore -h dbhost -d dbname path
    mongorestore -d test_bkp tmp/test
    # 这时候可以看到一个新增了一个叫test_bkp的db

    10.2 创建/删除/重命名collection

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # 创建:db.createCollection(name, options)
    > db.createCollection('case2img')

    # 显示已有tables
    > show collections

    # 不用显示创建,在db insert的时候会自动创建集合
    > db.sheet2.insert({"name" : "img2"})

    # 删除:db.COLLECTION_NAME.drop()
    > db.sheet2.drop()

    # 重命名:db.COLLECTION_NAME.renameCollection('NEWNAME')
    > db.sheet2.renameCollection('sheet3')
    # 复制:db.COLLECTION_NAME.aggregate({$out: 'NEWNAME'})
    > db.sheet2.aggregate({ $out : "sheet3" })

    10.3 插入/显示/更新/删除document

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    # 插入
    db.COLLECTION_NAME.insert(document)
    db.COLLECTION_NAME.save(document)
    db.COLLECTION_NAME.insertOne()
    db.COLLECTION_NAME.insertMany()

    # 显示已有doc
    db.COLLECTION_NAME.find()

    # 更新doc的部分内容
    db.COLLECTION_NAME.update(
    <query>, # 查询条件
    <update>, # 更新操作
    {
    upsert: <boolean>, # if true 如果不存在则插入
    multi: <boolean>, # find fist/all match
    writeConcern: <document>
    }
    )
    > db.case2img.insert({"case": "s0", "name": "img0"})
    > db.case2img.insert({"case": "s1", "name": "img1"})
    > db.case2img.find()
    > db.case2img.update({'case': 's1'}, {$set: {'case': 's2', 'name': 'img2'}})
    > db.case2img.find()

    # 给doc的某个key重命名
    db.COLLECTION_NAME.updateMany(
    {},
    {'$rename': {"old_key": "new_key"}}
    )

    # 更新整条文档by object_id
    db.COLLECTION_NAME.save(
    <document>,
    {
    writeConcern: <document>
    }
    )
    > db.case2img.save({"_id": ObjectId("60474e4b77e21bad9bd4655a"), "case":"s3", "name":"img3"})

    # 删除满足条件的doc
    db.COLLECTION_NAME.remove(
    <query>,
    {
    justOne: <boolean>, # find fist/all match
    writeConcern: <document>
    }
    )
    > db.case2img.remove({"case": "s0"})

    # 删除所有doc
    > db.case2img.remove({})

    10.4 简单查询find

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    > db.case2img.insert({"case": "s0", "name": "img0"})
    > db.case2img.insert({"case": "s1", "name": "img1"})
    > db.case2img.insert({"case": "s2", "name": "img2"})
    > db.case2img.insert({"case": "s2", "name": "img3"})

    # 查询表中的doc:db.COLLECTION_NAME.find({query})
    > db.case2img.find({'case': s2})
    > db.case2img.find({'case': 's1'}, {"name":1}) # projection的value在对应的key-value是list的时候有意义

    # 格式化显示查询结果:db.COLLECTION_NAME.find({query}).pretty()
    > db.case2img.find({'case': s2}).pretty()

    # 读取指定数量的数据记录:db.COLLECTION_NAME.find({query}).limit(NUMBER)
    > db.case2img.find({'case': {$type: 'string'}}).limit(1)

    # 跳过指定数量的数据:db.COLLECTION_NAME.find({query}).skip(NUMBER)
    > db.case2img.find({'case': {$type: 'string'}}).skip(1)

    10.5 条件操作符

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    (>) 大于 - $gt
    (<) 小于 - $lt
    (>=) 大于等于 - $gte
    (<=) 小于等于 - $lte
    (or) 或 - $or

    > db.case2img.update({'case':'s1'}, {$set: {"name":'img1', 'size':100}})
    WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
    > db.case2img.update({'case':'s2'}, {$set: {"name":'img2', 'size':200}})

    # 查询size>150的doc
    > db.case2img.find({'size': {$gt: 150}})

    # 查询满足任意一个条件的doc
    > db.case2img.find({'$or': [{'case':'s1'}, {'size': {$gt: 150}}]})

    10.6 数据类型操作符

    1
    2
    3
    4
    5
    type(KEY)等于 - $type

    # 比较对象可以是字符串/对应的reflect NUM
    > db.case2img.find({'case': {$type: 'string'}})
    > db.case2img.find({'case': {$type: '0'}})

    10.7 排序find().sort

    1
    2
    3
    4
    # 通过指定字段&指定升序/降序来对数据排序:db.COLLECTION_NAME.find().sort({KEY:1/-1})
    > db.case2img.find().sort({'name':1})

    # skip(), limilt(), sort()三个放在一起执行的时候,执行的顺序是先 sort(), 然后是 skip(),最后是显示的 limit()。

    10.8 索引

    skip

    10.9 聚合aggregate

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    # 用于统计 db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

    # by group
    > db.case2img.aggregate([{$group: {_id: '$case', img_num:{$sum:1}}}])
    group by key value 'case'
    count number of items in each group
    refer to the number as img_num

    > db.case2img.aggregate([{$group: {_id: '$case', img_num:{$sum:'$size'}}}])
    计算每一个group内,size值的总和

    # by match
    > db.case2img.aggregate([{$match: {'size': {$gt:150}}},
    {$group:{_id: null, totalsize: {$sum: '$size'}}}])
    类似shell的管道,match用来筛选条件,符合条件的送入下一步统计

    > db.case2img.aggregate([{$skip: 4},
    {$group:{_id: null, totalsize: {$sum: '$size'}}}])
  11. 快速统计distinct

    1
    2
    db.case2img.distinct(TAG_NAME)
    # 注意如果distinct的内容太长,超过16M,会报distinct too big的error,推荐用聚合来做统计
1
2
3
4
5
6
7
8
9
10
11
12
	
12. pymongo

用python代码来操作数据库

先安装:pip install pymongo

11.1 连接client

```python
from pymongo import MongoClient
Client = MongoClient()

11.2 获取数据库

1
2
db = Client.DB_NAME
db = Client['DB_NAME']

11.3 获取collection

1
2
collection = db.COLLECTION_NAME
collection = db['COLLECTION_NAME']

11.4 插入doc

1
2
3
4
5
6
7
8
9
10
11
12
# insert one
document1 = {'x':1}
document2 = {'x':2}
post_1 = collection.insert_one(document1).inserted_id
post_2 = collection.insert_one(document2).inserted_id
print(post_1)

# insert many
new_document = [{'x':1},{'x':2}]
# new_document = [document1,document2] 注意doc是神拷贝,只能作为一条doc被插入一次
result = collection.insert_many(new_document).inserted_ids
print(result)

11.5 查找

1
2
3
4
5
6
7
8
9
10
from bson.objectid import ObjectId

# find one 返回一条doc
result = collection.find_one()
result = collection.find_one({'case': 's0'})
result = collection.find_one({'_id': ObjectId('604752f277e21bad9bd46560')})

# find 返回一个迭代器
for _, item in enumerate(collection.find()):
print(item)

11.6 更新

1
2
3
4
5
6
# update one
collection.update_one({'case':'s1'},{'$set':{'size':300}})
collection.update_one({'case':'s1'},{'$push':{'add':1}}) # 追加数组内容

# update many
collection.update_many({'case':'s1'},{'$set':{'size':300}})

11.7 删除

1
2
3
# 在mongo shell里面是remove方法,在pymongo里面被deprecated成delete方法
collection.delete_one({"case": "s2"})
collection.delete_many({"case": "s1"})

11.8 统计

1
2
3
4
5
# 计数:count方法已经被重构
print(collection.count_documents({'case':'s0'}))

# unique:distinct方法
print(collection.distinct('case'))

​ 11.9 正则

​ mongo shell命令行里的正则和pymongo脚本里的正则写法是不一样的,因为python里面有封装正则方法,然后通过bson将python的正则转换成数据库的正则

1
2
3
4
5
6
7
8
9
10
# pymongo
import re
import bson

pattern = re.compile(r'(.*)-0[345]-(.*)')
regex = bson.regex.Regex.from_native(pattern)
result = collection.aggregate([{'$match': {'date': regex}}])

# mongo shell
> db.collection.find({date:{$regex:"(.*)-0[345]-(.*)"}})

docker

发表于 2021-03-04 |
  1. shartup

    • 部署方案

      • 古早年代

      • 虚拟机

      • docker

    • image镜像 & container容器 & registry仓库

      • 镜像:相当于是一个 root 文件系统,提供容器运行时所需的程序、库、资源、配置等
      • 容器:镜像运行时的实体,可以被创建、启动、停止、删除、暂停等
      • 仓库:用来保存镜像
        • 官方仓库:docker hub:https://hub.docker.com/r/floydhub/tensorflow/tags?page=1&ordering=last_updated
  2. 常用命令

    • 拉镜像
      • docker pull [选项] [Docker Registry 地址[:端口号]/]仓库名[:标签]
      • 地址可以是官方地址,也可以是第三方(如Harbor)
      • 仓库名由作者名和软件名组成(如zhangruiming/skin)
      • 标签用来指定某个版本的image,省略则默认latest
    • 列出所有镜像
      • docker images
    • 删除镜像
      • docker rmi [-f] [镜像id]
      • 删除镜像之前要kill/rm所有使用该镜像的container:docker rm [容器id]
    • 运行镜像并创建一个容器
      • docker run [-it] [仓库名] [命令]
      • 选项 -it:为容器配置一个交互终端
      • 选项 -d:后台运行容器,并返回容器ID(不直接进入终端)
      • 选项 —name=’xxx’:为容器指定一个名称
      • 选项-v /host_dir:/container_dir:将主机上指定目录映射到容器的指定目录
      • [命令]参数必须要加,而且要是那种一直挂起的命令(/bin/bash),如果是ls/cd/直接不填,那么命令运行完容器就会停止运行,docker ps -a查看状态,发现都是Exited
    • 创建容器
      • docker run
    • 查看所有容器
      • docker ps
    • 启动一个已经停止的容器/停止正在运行的容器
      • docker start [容器id]
      • docker stop [容器id]
    • 进入容器
      • docker exec -it [容器id] [linux命令]
    • 删除容器
      • docker rm [容器id]
    • 删除所有不活跃的容器
      • docker container prune
    • 提交镜像到远端仓库
      • docker tag [镜像id] [用户名]/[仓库]:[标签] # 重命名
      • docker login # 登陆用户
      • docker push
  3. 案例

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    # 拉镜像
    docker pull 地址/仓库:标签

    # 显示镜像
    docker images

    # 运行指定镜像
    docker run -itd --name='test' 地址/仓库:标签

    # 查看运行的容器
    docker ps

    # 进入容器
    docker exec -it 容器id或name /bin/bash

    # 一顿操作完退出容器
    exit

    # 将修改后的容器保存为镜像
    docker commit 容器id或name 新镜像名字
    docker images可以看到这个镜像了

    # 保存镜像到本地
    docker save -o tf_torch.rar tf_torch

    # 还原镜像
    docker load --input tf_torch.tar

    # 重命名镜像
    docker tag 3db0b2f40a70 amberzzzz/tf1.14_torch1.4_cuda10.0:v1

    # 提交镜像
    docker push amberzzzz/tf1.14-torch0.5-cuda10.0:v1
    1. dockerfile

      • Dockerfile 是用来说明如何自动构建 docker image 的指令集文件

      • 常用命令

        • FROM image_name,指定依赖的镜像
        • RUN command,在 shell 或者 exec 的环境下执行的命令
        • COPY srcfile_path_inhost dstfile_incontainer,将本机文件复制到容器中
        • ADD srcfile_path dstfile_incontainer,将本机文件复制到容器中,src文件不仅可以是local host,也可以是网络地址
        • CMD [“executable”,”param1”,”param2”],指定容器启动默认执行的命令
        • WORKDIR path_incontainer,指定 RUN、CMD 与 ENTRYPOINT 命令的工作目录
        • VOLUME [“/data”],授权访问从容器内到主机上的目录
      • basic image

        • 从nvidia docker开始:https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated&name=10.

        • 选一个喜欢的:如docker pull nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

        • 然后编辑dockerfile

          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          15
          16
          17
          18
          19
          20
          FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
          MAINTAINER amber <amber.zhang@tum.de>

          # install basic dependencies
          RUN apt-get update
          RUN apt-get install -y wget vim cmake

          # install Anaconda3
          RUN wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.sh -O ~/anaconda3.sh
          RUN bash ~/anaconda3.sh -b -p /home/anaconda3 && rm ~/anaconda3.sh
          ENV PATH /home/anaconda3/bin:$PATH
          # RUN echo "export PATH=/home/anaconda3/bin:$PATH" >> ~/.bashrc && /bin/bash -c "source /root/.bashrc"

          # change mirror
          RUN mkdir ~/.pip \
          && cd ~/.pip
          RUN echo '[global]\nindex-url = https://pypi.tuna.tsinghua.edu.cn/simple/' >> ~/.pip/pip.conf

          # install tensorflow
          RUN /home/anaconda3/bin/pip install tensorflow-gpu==1.8.0
      * 然后build dockerfile

          
1
docker build -t <docker_name> .

layer norm

发表于 2021-03-02 |

综述

  1. papers

[batch norm 2015] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,inceptionV2,Google Team,归一化层的始祖,加速训练&正则,BN被后辈追着打的主要痛点:approximation by mini-batch,test phase frozen

[layer norm 2016] Layer Normalization,Toronto+Google,针对BN不适用small batch和RNN的问题,主要用于RNN,在CNN上不好,在test的时候也是active的,因为mean&variance由于当前数据决定,有负责rescale和reshift的layer params

[weight norm 2016] Weight normalization: A simple reparameterization to accelerate training of deep neural networks,OpenAI,

[cosine norm 2017] Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks,中科院,

[instance norm 2017] Instance Normalization: The Missing Ingredient for Fast Stylization,高校report,针对风格迁移,IN在test的时候也是active的,而不是freeze的,单纯的instance-independent norm,没有layer params

[group norm 2018] Group Normalization,FAIR Kaiming,针对BN在small batch上性能下降的问题,提出batch-independent的

[weight standardization 2019] Weight Standardization,Johns Hopkins,

[batch-channel normalization & weight standardization 2020] BCN&WS: Micro-Batch Training with Batch-Channel Normalization and Weight Standardization,Johns Hopkins,

  1. why Normalization

    • 独立同分布:independent and identically distribute

    • 白化:whitening([PCA whitening][http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/])

      • 去除特征之间的相关性
      • 使所有特征具有相同的均值和方差
    • 样本分布变化:Internal Covariate Shift

      • 对于神经网络的各层输入,由于stacking internel byproduct,每层的分布显然各不相同,但是对于某个特定的样本输入,他们所指示的label是不变的
      • 即源空间和目标空间的条件概率是一致的,但是边缘概率是不同的

      • 每个神经元的数据不再是独立同分布,网络需要不断适应新的分布,上层神经元容易饱和:网络训练又慢又不稳定

  2. how to Normalization

    • preparation

      • unit:一个神经元(一个op),输入[b,N,C_in],输出[b,N,1]
      • layer:一层的神经元(一系列op,$W\in R^{M*N}$),在channel-dim上concat当前层所有unit的输出[b,N,C_out]
      • dims
        • b:batch dimension
        • N:spatial dimension,1/2/3-dims
        • C:channel dimension
      • unified representation:本质上都是对数据在规范化
        • $h = f(g*\frac{x-\mu}{\sigma}+b)$:先归一化,在rescale & reshift
        • $\mu$ & $\sigma$:compute from上一层的特征值
        • $g$ & $b$:learnable params基于当前层
        • $f$:neurons’ weighting operation
        • 各方法的主要区别在于mean & variance的计算维度
    • 对数据

      • BN:以一层每个神经元的输出为单位,即每个channel的mean&var相互独立
      • LN:以一层所有神经元的输出为单位,即每个sample的mean&var相互独立
      • IN:以每个sample在每个神经元的输出为单位,每个sample在每个channel的mean&var都相互独立
      • GN:以每个sample在一组神经元的输出为单位,一组包含一个神经元的时候变成IN,一组包含一层所有神经元的时候就是LN
      • 示意图:

    • 对权重

      • WN:将权重分解为单位向量和一个固定标量,相当于神经元的任意输入vec点乘了一个单位vec(downscale),再rescale,进一步地相当于没有做shift和reshift的数据normalization
      • WS:对权重做全套(归一化再recale),比WN多了shift,“zero-center is the key”
    • 对op

      • CosN:

        • 将线性变换op替换成cos op:$f_w(x) = cos = \frac{w \cdot x}{|w||x|}$

        • 数学本质上又退化成了只有downscale的变换,表征能力不足

Whitening白化

  1. purpose

    • images的adjacent pixel values are highly correlated,thus redundant
    • linearly move the origin distribution,making the inputs share the same mean & variance
  2. method

    • 首先进行PCA预处理,去掉correlation

      • mean on sample(注意不是mean on image)

      • 协方差矩阵

      • 奇异值分解svd(S)

        • $\Sigma$为对角矩阵,对角上的元素为奇异值
        • $U=[u_1,u_2,…u_N]$中是奇异值对应的正交向量
      • 投影变换

        • 取投影矩阵$U_p$ from $U$,$U_p \in R^{N*d}$表示将数据空间从N维投影到$U_p$所在的d维空间上
      • recover(投影逆变换)

        * 取投影矩阵$U_r=U_p^T$,就是将 数据空间从d维空间再投影回N维空间上

* PCA白化:

    * 对PCA投影后的新坐标,做归一化处理:基于特征值进行缩放
        $$
        X_{PCAwhite} = \Sigma^{-\frac{1}{2}}X^{'} =  \Sigma^{-\frac{1}{2}}U^TX
        $$

    * $X_{PCAwhite}$的协方差矩阵$S_{PCAwhite} = I$,因此是去了correlation的

* ZCA白化:在上一步做完之后,再把它变换到原始空间,所以ZCA白化后的特征图更接近原始数据

    * 对PCA白化后的数据,再做一步recover
        $$
        X_{ZCAwhite} = U X_{PCAwhite}
        $$

    * 协方差矩阵仍旧是I,合法白化

Layer Normalization

  1. 动机

    • BN reduces training time

      • compute by each neuron
      • require moving average
      • depend on mini-batch size
      • how to apply to recurrent neural nets
    • propose layer norm

      • [unlike BN] compute by each layer
      • [like BN] with adaptive bias & gain
      • [unlike BN] perform the same computation at training & test time
      • [unlike BN] straightforward to apply to recurrent nets
      • work well for RNNs
  2. 论点

    • BN
      • reduce training time & serves as regularizer
      • require moving average:introduce dependencies between training cases
      • the approxmation of mean & variance expectations constraints on the size of a mini-batch
    • intuition
      • norm layer提升训练速度的核心是限制神经元输入输出的变化幅度,稳定梯度
      • 只要控制数据分布,就能保持训练速度
  3. 方法

    • compute over all hidden units in the same layer
    • different training cases have different normalization terms
    • 没啥好说的,就是在channel维度计算norm
    • further的GN把channel维度分组做norm,IN在直接每个特征计算norm
    • gain & bias
      • 也是在对应维度:(hwd)c-dim
      • https://tobiaslee.top/2019/11/21/understanding-layernorm/
      • 后续有实验发现,去掉两个learnable rescale params反而提点
      • 考虑是在training set上的过拟合
  4. 实验

    • RNN上有用
    • CNN上比没有norm layer好,但是没有BN好:因为channel是特征维度,特征维度之间有明显的有用/没用,不能简单的norm

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

  1. 动机
    • reparameterizing the weights
      • decouple length & direction
      • no dependency between samples which suits well for
        • recurrent
        • reinforcement
        • generative
    • no additional memory and computation
    • testified on
      • MLP with CIFAR
      • generative model VAE & DRAW
      • reinforcement DQN
  2. 论点
    • a neuron:
      • get inputs from former layers(neurons)
      • weighted sum over the inputs
      • add a bias
      • elementwise nonlinear transformation
      • batch outputs:one value per sample
    • intuition of normalization:
      • give gradients that are more like whitened natural gradients
      • BN:make the outputs of each neuron服从std norm
      • our WN:
        • inspired by BN
        • does not share BN’s across-sample property
        • no addition memory and tiny addition computation

Instance Normalization: The Missing Ingredient for Fast Stylization

  1. 动机

    • stylization:针对风格迁移网络
    • with a small change:swapping BN with IN
    • achieve qualitative improvement
  2. 论点

    • stylized image
      • a content image + a style image
      • both style and content statistics are obtained from a pretrained CNN for image classification
      • methods
        • optimization-based:iterative thus computationally inefficient
        • generator-based:single pass but never as good as
    • our work
      • revisit the feed-forward method
      • replace BN in the generator with IN
      • keep them at test time as opposed to freeze
  3. 方法

    • formulation

      • given a fixed stype image $x_0$
      • given a set of content images $x_t, t= 1,2,…,n$
      • given a pre-trained CNN
      • with a variable z controlling the generation of stylization results
      • compute the stylied image g($x_t$, z)
      • compare the statistics:$min_g \frac{1}{n} \sum^n_{t=1} L(x_0, x_t, g(x_t, z))$
      • comparing target:the contrast of the stylized image is similar to the constrast of the style image
    • observations

      • the more training examples, the poorer the qualitive results
      • the result of stylization still depent on the constrast of the content image
    • intuition

      • 风格迁移本质上就是将style image的contrast用在content image的:也就是rescale content image的contrast
      • constrast是per sample的:$\frac{pixel}{\sum pixels\ on\ the\ map}$

      • BN在norm的时候将batch samples搅合在了一起

    • IN

      • instance-specfic normalization
      • also known as contrast normalization

      • 就是per image做标准化,没有trainable/frozen params,在test phase也一样用

Group Normalization

  1. 动机

    • for small batch size
    • do normalization in channel groups
    • batch-independent
    • behaves stably over different batch sizes
    • approach BN’s accuracy

  2. 论点

    • BN
      • requires sufficiently large batch size (e.g. 32)
      • Mask R-CNN frameworks use a batch size of 1 or 2 images because of higher resolution, where BN is “frozen” by transforming to a linear layer
      • synchronized BN 、BR
    • LN & IN
      • effective for training sequential models or generative models
      • but have limited success in visual recognition
      • GN能转换成LN/IN
    • WN
      • normalize the filter weights, instead of operating on features
  3. 方法

    • group

      • it is not necessary to think of deep neural network features as unstructured vectors
        • 第一层卷积核通常存在一组对称的filter,这样就能捕获到相似特征
        • 这些特征对应的channel can be normalized together
    • normalization

      • transform the feature x:$\hat x_i = \frac{1}{\sigma}(x_i-\mu_i)$

      • the mean and the standard deviation:

      • the set $S_i$

        • BN:
          • $S_i=\{k|k_C = i_C\}$
          • pixels sharing the same channel index are normalized together
          • for each channel, BN computes μ and σ along the (N, H, W) axes
        • LN
          • $S_i=\{k|k_N = i_N\}$
          • pixels sharing the same batch index (per sample) are normalized together
          • LN computes μ and σ along the (C,H,W) axes for each sample
        • IN
          • $S_i=\{k|k_N = i_N, k_C=i_C\}$
          • pixels sharing the same batch index and the same channel index are normalized together
          • LN computes μ and σ along the (H,W) axes for each sample
        • GN
          • $S_i=\{k|k_N = i_N, [\frac{k_C}{C/G}]=[\frac{i_C}{C/G}]\}$
          • computes μ and σ along the (H, W ) axes and along a group of C/G channels
      • linear transform

        • to keep representational ability
        • per channel
        • scale and shift:$y_i = \gamma \hat x_i + \beta$

    • relation

      • to LN
        • LN assumes all channels in a layer make “similar contributions”
        • which is less valid with the presence of convolutions
        • GN improved representational power over LN
      • to IN
        • IN can only rely on the spatial dimension for computing the mean and variance
        • it misses the opportunity of exploiting the channel dependence
        • 【QUESTION】BN也没考虑通道间的联系啊,但是计算mean和variance时跨了sample
    • implementation

      • reshape
      • learnable $\gamma \& \beta$
      • computable mean & var

  4. 实验

    • GN相比于BN,training error更低,但是val error略高于BN
      • GN is effective for easing optimization
      • loses some regularization ability
      • it is possible that GN combined with a suitable regularizer will improve results
    • 选取不同的group数,所有的group>1均好于group=1(LN)
    • 选取不同的channel数(C/G),所有的channel>1均好于channel=1(IN)
    • Object Detection
      • frozen:因为higher resolution,batch size通常设置为2/GPU,这时的BN frozen成一个线性层$y=\gamma(x-\mu)/\sigma+beta$,其中的$\mu$和$sigma$是load了pre-trained model中保存的值,并且frozen掉,不再更新
      • denote as BN*
      • replace BN* with GN during fine-tuning
      • use a weight decay of 0 for the γ and β parameters

WS: Weight Standardization

  1. 动机

    • accelerate training
    • micro-batch:
      • 以BN with large-batch为基准
      • 目前BN with micro-batch及其他normalization methods都不能match这个baseline
    • operates on weights instead of activations
    • 效果
      • match or outperform BN
      • smooth the loss
  2. 论点

    • two facts

      • BN的performance gain与reduction of internal covariate shift没什么关系
      • BN使得optimization landscape significantly smoother
      • 因此our target is to find another technique
        • achieves smooth landscape
        • work with micro-batch
    • normalization methods

      • focus on activations
        • 不展开
      • focus on weights

        • WN:just length-direction decoupling

  3. 方法

    • Lipschitz constants

      • BN reduces the Lipschitz constants of the loss function
      • makes the gradient more Lipschitz
      • BN considers the Lipschitz constants with respect to activations,not the weights that the optimizer is directly optimizing
    • our inspiration

      • standardize the weights也同样能够smooth the landscape
      • 更直接
      • smoothing effects on activations and weights是可以累积的,因为是线性运算
    • Weight Standardization

      • reparameterize the original weights $W$
        • 对卷积层的权重参数做变换,no bias
        • $W \in R^{O * I}$
        • $O=C_{out}$
        • $I=C_{in}*kernel_size$
      • optimize the loss on $\hat W$
      • compute mean & var on I-dim
      • 只做标准化,无需affine,因为默认后续还要接一个normalization layer对神经元进行refine

    • WS normalizes gradients

      • 拆解:

        • eq5:$W$ to $\dot W$,减均值,zero-centered
        • eq6:$\dot W$ to $\hat W$,除方差,one-varianced
        • eq8:$\delta \hat W$由前一步的梯度normalize得到
        • eq9:$\delta \dot W$也由前一步的梯度normalize
        • 最终用于梯度更新的梯度是zero-centered

    • WS smooths landscape

      • 判定是否smooth就看Lipschitz constant的大小
      • eq5和eq6都能reduce the Lipschitz constant
      • 其中eq5 makes the major improvements
      • eq6 slightly improves,因为计算量不大,所以保留
  4. 实验

    • ImageNet

      • BN的batchsize是64,其余都是1,其余的梯度更新iterations改成64——使得参数更新次数同步
      • 所有的normalization methods加上WS都有提升
      • 裸的normalization methods里面batchsize1的GN最好,所以选用GN+WS做进一步实验
      • GN+WS+AF:加上conv weight的affine会harm

  5. code

1
2
3
4
5
6
# official release
# 放在WSConv2D子类的call里面
kernel_mean = tf.math.reduce_mean(kernel, axis=[0, 1, 2], keepdims=True, name='kernel_mean')
kernel = kernel - kernel_mean
kernel_std = tf.keras.backend.std(kernel, axis=[0, 1, 2], keepdims=True)
kernel = kernel / (kernel_std + 1e-5)

NFNet

发表于 2021-02-22 |

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

  1. 动机

    • NF:
      • normalization-free
      • aims to match the test acc of batch-normalized networks
        • attain new SOTA 86.5%
        • pre-training + fine-tuning上也表现更好89.2%
    • batch normalization
      • 不是完美解决方案
      • depends on batch size
    • non-normalized networks
      • accuracy
      • instabilities:develop adaptive gradient clipping
  2. 论点

    • vast majority models
      • variants of deep residual + BN
      • allow deeper, stable and regularizing
    • disadvantages of batch normalization
      • computational expensive
      • introduces discrepancy between training & testing models & increase params
      • breaks the independence among samples
    • methods seeks to replace BN
      • alternative normalizers
      • study the origin benefits of BN
      • train deep ResNets without normalization layers
    • key theme when removing normalization
      • suppress the scale of the residual branch
      • simplest way:apply a learnable scalar
      • recent work:suppress the branch at initialization & apply Scaled Weight Standardization,能追上ResNet家族,但是没追上Eff家族
    • our NFNets’ main contributions
      • propose AGC:解决unstable问题,allow larger batch size and stronger augmentatons
      • NFNets家族刷新SOTA:又快又准
      • pretraining + finetuning的成绩也比batch normed models好
  3. 方法

    • Understanding Batch Normalization

      • four main benefits
        • downscale the residual branch:从initialization就保证残差分支的scale比较小,使得网络has well-behaved gradients early in training,从而efficient optimization
        • eliminates mean-shift:ReLU是不对称的,stacking layers以后数据分布会累积偏移
        • regularizing effect:mini-batch作为subset对于全集是有偏的,这种noise可以看作是regularizer
        • allows efficient large-batch training:数据分布稳定所以loss变化稳定,同时大batch更接近真实分布,因此我们可以使用更大的learning rate,但是这个property仅在使用大batch size的时候有效
    • NF-ResNets

      • recovering the benefits of BN:对residual branch进行scale和mean-shift
    • residual block:$h_{i+1} = h_i + \alpha f_i (h_i/\beta_i)$

    • $\beta_i = Var(h_i)$:对输入进行标准化(方差为1),这是个expected value,不是算出来的,结构定死就定死了

    • Scaled Weight Standardization & scaled activation

      • 比原版的WS多了一个$\sqrt N$的分母
        • 源码实现中比原版WS还多了learnable affine gain
      • 使得conv-relu以后输出还是标准分布

    • $\alpha=0.2$:rescale

    • residual branch上,最终的输出为$\alpha*$标准分布,方差是$\alpha^2$

    • id path上,输出还是$h_{i}$,方差是$Var(h_i)$

      • update这个block输出的方差为$Var(h_{i+1}) = Var(h_i)+\alpha^2$,来更新下一个block的 $\beta$

      • variance reset

        • 每个transition block以后,把variance重新设定为$1+\alpha^2$
          • 在接下来的non-transition block中,用上面的update公式更新expected std
      • 再加上additional regularization(Dropout和Stochastic Depth两种正则手段),就满足了BN benefits的前三条

        • 在batch size较小的时候能够catch up甚至超越batch normalized models
          • 但是large batch size的时候perform worse
      • 对于一个标准的conv-bn-relu,从workflow上看

        • origin:input——一个free的conv weighting——BN(norm & rescale)——activation
          • NFNet:input——standard norm——normed weighting & activation——rescale
  • Adaptive Gradient Clipping for Efficient Large-Batch Training

    • 梯度裁剪:

      • clip by norm:用一个clipping threshold $\lambda$ 进行rescale,training stability was extremely sensitive to 超参的选择,settings(model depth, the batch size, or the learning rate)一变超参就要重新调
  • clip by value:用一个clipping value进行上下限截断

  • AGC

    • given 某层的权重$W \in R^{NM}$ 和 对应梯度$G \in R^{NM}$

    • ratio $\frac{||G||_F}{||W||_F}$ 可以看作是梯度变化大小的measurement

    • 所以我们直观地想到将这个ratio进行限幅:所谓的adaptive就是在梯度裁剪的时候不是对所有梯度一刀切,而是考虑其对应权重大小,从而进行更合理的调节

      • 但是实验中发现unit-wise的gradient norm要比layer-wise的好:每个unit就是每行,对于conv weights就是(hxwxCin)中的一个

    • scalar hyperparameter $\lambda$

        * the optimal value may depend on the choice of optimizer, learning rate and batch size
        * empirically we found $\lambda$ should be smaller for larger batches
      
      • ablations for AGC

        • 用pre-activation NF-ResNet-50 和 NF-ResNet-200 做实验,batch size选择从256到4096,学习率从0.1开始基于batch size线性增长,超参$\lambda$的取值见右图
          • 左图结论1:在batch size较小的情况下,NF-Nets能够追上甚至超越normed models的精度,但是batch size一大(2048)情况就恶化了,但是有AGC的NF-Nets则能够maintaining performance comparable or better than~~~
          • 左图结论2:the benefits of using AGC are smaller when the batch size is small
      • 右图结论1:超参$\lambda$的取值比较小的时候,我们对梯度的clipping更strong,这对于使用大batch size训练的稳定性来说非常重要
    • whether or not AGC is beneficial for all layers

            * it is always better to not clip the final linear layer 
        * 最开始的卷积不做梯度裁剪也能稳定训练
      
      • 最终we apply AGC to every layer except for the final linear layer
    • Normalizer-Free Architectures

    • begin with SE-ResNeXt-D model

    • about group width

            * set group width to 128
      
      • the reduction in compute density means that 只减少了理论上的FLOPs,没有实际加速
    • about stages

            * R系列模型加深的时候是非线性增长,疯狂叠加stage3的block数,因为这一层resolution不大,channel也不是最多,兼顾了两侧计算量
      
      • 我们给F0设置为[1,2,6,3],然后在deeper variants中对每个stage的block数用一个scalar N线形增长
    • about width

            * 仍旧对stage3下手,[256,512,1536,1536]
            * roughly preserves the training speed
      
      • 一个论点:stage3 is the best place to add capacity,因为deeper enough同时have access to deeper levels同时又比最后一层有slightly higher resolution
    • about block

        * 实验发现最有用的操作是adding an additional 3 × 3 grouped conv after the first
        * overview
      

    • about scaling variants

            * eff系列采用的是R、W、D一起增长,因为eff的block比较轻量
      
      • 但是对R系列来说,只增长D和R就够了

      • 补充细节

          * 在inference阶段使用比训练阶段slightly higher resolution
        
        • 随着模型加大increase the regularization strength:
          • scale the drop rate of Dropout
          • 调整stochastic depth rate和weight decay则not effective
        • se-block的scale乘个2

        • SGD params:

          • Nesterov=True, momentum=0.9, clipnorm=0.01
          • lr:
            • 先warmup再余弦退火:increase from 0 to 1.6 over 5 epochs, then decay to zero with cosine annealing
            • 余弦退火cosine annealing

        • summary

          • 总结来说,就是拿来一个SE-ResNeXt-D
          • 先做结构上的调整,modified width and depth patterns以及a second spatial convolution,还有drop rate,resolution
          • 再做对梯度的调整:除了最后一个线形分类层以外,全用AGC,$\lambda=0.01$
        • 最后是训练上的trick:strong regularization and data augmentation

      • detailed view of NFBlocks

        • transition block:有下采样的block
          • 残差branch上,bottleneck的narrow ratio是0.5
          • 每个stage的3x3 conv的group width永远是128,而group数目是在随着block width变的
          • skip path接在 $\beta$ downscaling 之后
          • skip path上是avg pooling + 1x1 conv
        • non-transition block:无下采样的block

          • bottleneck-ratio仍旧是0.5
          • 3x3conv的group width仍旧是128
          • skip path接在$\beta$ downscaling 之前
          • skip path就是id

  1. 实验

repVGG

发表于 2021-02-09 |

RepVGG: Making VGG-style ConvNets Great Again

  1. 动机

    • plain ConvNets
      • simply efficient but poor performance
    • propose a CNN architecture RepVGG
      • 能够decouple为training-time和inference-time两个结构
      • 通过structure re-paramterization technique
      • inference-time architecture has a VGG-like plain body
    • faster
      • 83% faster than ResNet-50 or 101% faster than ResNet-101
    • accuracy-speed trade-off
      • reaches over 80% top-1 accuracy
      • outperforms ResNets by a large margin
    • verify on classification & semantic segmentation tasks
  2. 论点

    • well-designed CNN architectures

      • Inception,ResNet,DenseNet,NAS models
      • deliver higher accuracy
      • drawbacks
        • multi-branch designs:slow down inference and reduce memory utilization,对高并行化的设备不友好
        • some components:depthwise & channel shuffle,increase memory access cost
      • MAC(memory access cost) constitutes a large time usage in groupwise convolution:我的groupconv实现里cardinality维度上计算不并行
      • FLOPs并不能precisely reflect actual speed,一些结构看似比old fashioned VGG/resnet的FLOPs少,但实际并没有快
    • multi-branch

      • 通常multi-branch model要比plain model表现好
      • 因为makes the model an implicit ensemble of numerous shallower models
      • so that avoids gradient vanishing
      • benefits are all for training
      • drawbacks are undesired for inference
    • the proposed RepVGG

      • advantages
        • plain architecture:no branches
        • 3x3 conv & ReLU组成
        • 没有过重的人工设计痕迹
      • training time use identity & 1x1 conv branches
      • at inference time

        • identity 可以看做degraded 1x1 conv
        • 1x1 conv 可以看做degraded 3x3 conv
        • 最终整个conv-bn branches能够整合成一个3x3 conv
        • inference-time model只包含conv和ReLU:没有max pooling!!
        • fewer memory units:分支会占内存,直到分支计算结束,plain结构的memory则是immediately released

  3. 方法

    • training-time

      • ResNet-like block
        • id + 1x1 conv + 3x3 conv multi-branches
        • use BN in each branch
        • with n blocks, the model can be interpreted as an ensemble of $3^n$ models
        • stride2的block应该没有id path吧??
      • simply stack serveral blocks to construct the training model
    • inference-time

      • re-param

        • inference-time BN也是一个线性计算
        • 两个1x1 conv都可以转换成中通的3x3 kernel,有权/无权
        • 要求各branch has the same strides & padding pixel要对齐

      • architectural specification

        • variety:depth and width
        • does not use maxpooling:只有一种operator:3x3 conv+relu
        • head:GAP + fc / task specific
        • 5 stages
          • 第一个stage处理high resolution,stride2
          • 第五个stage shall have more channels,所以只用一层,save parameters
          • 给倒数第二个stage最多层,考虑params和computation的balance
        • RepVGG-A:[1,2,4,14,1],用来compete against轻量和中量级model
        • RepVGG-B:deeper in s2,3,4,[1,4,6,16,1],用来compete against high-performance ones

        • basic width:[64, 128, 256, 512]

          • width multiplier a & b
          • a控制前4个stage宽度,b控制最后一个stage
          • [64a, 128a, 256a, 512b]
          • 第一个stage的宽度只接受变小不接受变大,因为大resolution影响计算量,min(64,64a)

        • further reduce params & computation

          • groupwise 3x3 conv
          • 跳着层换:从第三开始,第三、第五、
          • number of groups:1,2,4 globally
  4. 实验

    • 分支的作用

    • 结构上的微调

      • id path去掉BN
      • 把所有的BN移动到add的后面
      • 每个path加上relu

    • ImageNet分类任务上对标其他模型

      • simple augmentation

      • strong:Autoaugment, label smoothing and mixup

1…678…18
amber.zhang

amber.zhang

要糖有糖,要猫有猫

180 日志
98 标签
GitHub
© 2023 amber.zhang
由 Hexo 强力驱动
|
主题 — NexT.Muse v5.1.4