GoogLeNet系列

综述

  1. papers

    • [V1] Going Deeper with Convolutions, 6.67% test error

    • [V2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 4.8% test error

    • [V3] Rethinking the Inception Architecture for Computer Vision, 3.5% test error

    • [V4] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, 3.08% test error

    • [Xception] Xception: Deep Learning with Depthwise Separable Convolutions

    • [EfficientNet] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    • [EfficientDet] EfficientDet: Scalable and Efficient Object Detection

    • [EfficientNetV2] EfficientNetV2: Smaller Models and Faster Training

  1. 大体思路

    • inception V1:打破传统的conv block,设计了Inception block,将1*1、3*3、5*5的卷积结果concat,增加网络宽度
    • inception V2:加入了BN层,减少Internal Covariate Shift,用两个3*3替代5*5,降低参数量
    • inception V3:提出分解Factorization,7*7改成7*1和1*7,参数减少加速计算,增加网络深度和非线性
    • inception V4:结合Residual Connection
    • Xception:针对inception V3的分解结构的改进,使用可分离卷积
    • EfficientNet:主要研究model scaling,针对网络深度、宽度、图像分辨率,有效地扩展CNN
    • EfficientDet:将EfficientNet从分类任务扩展到目标检测任务

review

  1. review0122:conv-BN层合并运算

    • reference:https://nenadmarkus.com/p/fusing-batchnorm-and-conv/

    • freezed BN可以看成1x1的卷积运算

    • 两个线性运算是可以合并的

    • given $W_{conv} \in R^{CC_{prev}kk}$,$b_{conv} \in R^C $,$W_{bn}\in R^{CC}$,$b_{bn}\in R^C$

V1: Going deeper with convolutions

  1. 动机

    • improved utilization of the computing resources
    • increasing the depth and width of the network while keeping the computational budget
  2. 论点

    • the recent trend has been to increase the number of layers and layer size, while using dropout to address the problem of overfitting
    • major bottleneck:large network,large number of params,limited dataset,overfitting
    • methods use filters of different sizes in order to handle multiple scales
    • NiN use 1x1 convolutional layers to easily integrate in the current CNN pipelines
    • we use 1x1 convs with a dual purpose of dimension reduction
  3. 方法

    • Architectural

      • 1x1 conv+ReLU for compute reductions
      • an alternative parallel pooling path since pooling operations have been essential for the success

      • overall architecture :

      • 细节:

        • rectified linear activation

        • mean subtraction

        • a move from fully connected layers to average pooling improves acc

        • the use of dropout remained essential

        • adding auxiliary classifiers(on 4c&4d) with a discount weight

          • 5x5 avg pool, stride 3
          • 1x1 conv+relu, 128 filters
          • 1024 fc+relu
          • 70% dropout
          • 1000 fc+softmax

        • asynchronous stochastic gradient descent with 0.9 momentum

        • fixed learning rate schedule (de- creasing the learning rate by 4% every 8 epochs

        • photometric distortions useful to combat overfitting

        • random interpolation methods for resizing

V2: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

  1. 动机

    • use much higher learning rates
    • be less careful about initialization
    • also acts as a regularizer, eliminating the need for Dropout
  2. 论点

    • SGD:optimizes the parameters $\theta$ of the network, so as to minimize the loss

      梯度更新:,$x_i$ is the full set

      batch approximation:use $\frac{1}{m} \sum_M \frac{\partial loss(\theta)}{\partial \theta}$,$x_i$ is the mini-batch set

      • quality improves as the batch size increases
      • computation over a batch is much more efficient than m computations for individual examples
      • the learning rate and the initial values require careful tuning
    • Internal covariate shift

      • the input distribution of the layers changes
      • consider a gradient descent step above,$x$的数据分布改变了,$\theta$就要相应地 readjust to compensate for the change in the distribution of x
    • activation

      • 对于神经元$z = sigmoid(Wx+b)$,前面层的参数变化,很容易导致当前神经元的响应值不在有效活动区间,从而导致过了当前激活函数以后梯度消失,slow down the convergence
      • In practice, using ReLU & careful initialization & small learning rates
      • 如果我们能使得the distribution of nonlinearity inputs remains more stable as the network trains,就不会出现神经元饱和的问题
    • whitening

      • 对training set的预处理:linearly transformed to have zero means and unit variances, and decorrelated
      • 使得输入数据的分布保持稳定,normal distribution
      • 同时去除了数据间的相关性
    • batch normalization

      • fixes the means and variances of layer inputs
      • reducing the dependence of gradients on the scale of the parameters or of their initial values
      • makes it possible to use saturating nonlinearities
    * full whitening of each layer is costly
    * so we normalize each layer independently, full set--> mini-batch
    * standard normal distribution并不是每个神经元所需的(如identity transform):introduce, for each activation $x(k)$ , a pair of parameters $\gamma(k)$, $\beta(k)$, which scale and shift the normalized value to maintain the representation ability of the neuron

    ​    

    * for convolutional networks

        * we add the BN transform immediately before the nonlinearity, $z = g(Wx+b)$ to $z = g(BN(Wx))$

        * since we normalize $Wx+b$, the bias b can be ignored  
        * obey the convolutional property——different elements of the same feature map, at different locations, are normalized in the same way
        * We learn a pair of parameters $\gamma(k)$ and $\beta(k)$ **per feature map**, rather than per activation

    * properties

        * back-propagation through a layer is unaffected by the scale of its parameters
        * Moreover, larger weights lead to smaller gradients, thus stabilizing the parameter growth
        * regularizes the model:因为网络中mini-batch的数据之间是有互相影响的而非independent的
  1. 方法

    • batch normalization

      • full whitening of each layer is costly
      • so we normalize each layer independently, full set—> mini-batch
      • standard normal distribution并不是每个神经元所需的(如identity transform):introduce, for each activation $x(k)$ , a pair of parameters $\gamma(k)$, $\beta(k)$, which scale and shift the normalized value to maintain the representation ability of the neuron

    • bp:

    • inference阶段:

      • 首先两个可学习参数$\gamma$和$\beta$是定下来的
      • 而均值和方差不再通过输入数据来计算,而是载入训练过程中维护的参数(moving averages)

    • for convolutional networks

      • we add the BN transform immediately before the nonlinearity, $z = g(Wx+b)$ to $z = g(BN(Wx))$

      • since we normalize $Wx+b$, the bias b can be ignored

      • obey the convolutional property——different elements of the same feature map, at different locations, are normalized in the same way
      • We learn a pair of parameters $\gamma(k)$ and $\beta(k)$ per feature map, rather than per activation
    • properties

      • back-propagation through a layer is unaffected by the scale of its parameters
      • Moreover, larger weights lead to smaller gradients, thus stabilizing the parameter growth
      • regularizes the model:因为网络中mini-batch的数据之间是有互相影响的而非independent的

V3: Rethinking the Inception Architecture for Computer Vision

  1. 动机

    • go deeper and wider:
      • enough labeled data
      • computational efficiency
      • parameter count
    • to scale up networks
      • utilizing the added computation as efficiently
      • give general design principles and optimization ideas
        • factorized convolutions
        • aggressive regularization
  2. 论点

    • GoogleNet does not provide a clear description about the contributing factors that lead to the various design
  3. 方法

    • General Design Principles

      • Avoid representational bottlenecks:特征图尺寸应该gently decrease,resolution的下降必须伴随着channel数的上升,避免使用max pooling层进行下采样,因为这样导致信息损失较大
      • Higher dimensional representations are easier to process locally within a network. Increasing the activa- tions per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster:前半句懂了,high-reso的特征图focus在局部信息,后半句不懂,根据上一篇paper,用了batch norm以后,scale up神经元不影响bp,同时会lead to smaller gradients,为啥能加速?
      • Spatial aggregation can be done over lower dimensional embeddings:adjacent unit之间有strong correlation,所以可以reduce the dimension of the input representation before the spatial aggregation,不会有太大的信息损失,并且promotes faster learning
      • The computational budget should therefore be distributed in a balanced way between the depth and width of the network.
    • Factorizing Convolutions Filter Size

      • into smaller convolutions
        • 大filter都可以拆解成多个3x3
        • 单纯去等价线性分解可以不使用非线性activation,但是我们使用了batch norm(increase variaty),所以观察到使用ReLU以后拟合效果更好
      • into Asymmetric Convolutions
        • n*n的filter拆解成1*n和n*1
        • this factorization does not work well on early layers, but gives very good results on medium grid-sizes (ranges between 12 and 20, using 1x7 and 7x1
    • Utility of Auxiliary Classifiers

      • did not result in improved convergence early in the training:训练开始阶段没啥用,快收敛时候有点点acc提升
      • removal of the lower auxiliary branch did not have any adverse effect on the final quality:拿掉对最终结果没影响
      • 所以最初的设想(help evolving the low-level features) 是错的,仅仅act as regularizer,auxiliary head里面加上batch norm会使得最终结果better
    • Efficient Grid Size Reduction下采样模块不再使用maxpooling

      • dxdxk feature map expand to (d/2)x(d/2)x2k:

        • 1x1x2k conv,stride2 pool:kxdxdx2k computation
        • 1x1x2k stride2 conv:kx(d/2)x(d/2)x2k computation,计算量下降,但是违反principle1
        • parallel stride P and C blocks:kx(d/2)x(d/2)xk computation,符合principle1:reduces the grid-size while expands the filter banks

    • Inception-v3

      • 开头的7x7conv已经换成了多个3x3
      • 中间层featuremap降维到17x17的时候,开始用Asymmetric Factorization block
      • 到8x8的时候,做了expanding the filter bank outputs

    • Label Smoothing (https://zhuanlan.zhihu.com/p/116466239)

      • used the uniform distribution $u(k)=1/K$

      • 对于softmax公式:$p(k)=\frac{exp(y_k)}{\sum exp(y_i)}$,这个loss训练的结果就是$y_k$无限趋近于1,其他$y_i$无限趋近于0,

      • 交叉熵loss:$ce=\sum -y_{gt}log(y_k)$,加了label smoothing以后,loss上增加了阴性样本的regularization,正负样本的最优解被限定在有限值,通过抑制正负样本输出差值,使得网络有更强的泛化能力。

V4: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

  1. 动机

    • residual:whether there are any benefit in combining the Inception architecture with residual connections
    • inceptionV4:simplify the inception blocks
  2. 论点

    • residual connections seems to improve the training speed greatly:但是没有也能训练深层网络
    • made uniform choices for the Inception blocks for each grid size
      • Inception-A for 35x35
      • Inception-B for 17x17
      • Inception-C for 8x8
    • for residual versions
      • use cheaper Inception blocks for residual versions:简化module,因为identity部分(直接相连的线)本身包含丰富的特征信息
      • 没有使用pooling
      • replace the filter concatenation stage of the Inception architecture with residual connections:原来block里面的concatenation主体放在残差path中
      • Each Inception block is followed by filter-expansion layer (1 × 1 convolution without activation) to match the depth of the input for addition:相加之前保证channel数一致
      • used batch-normalization only on top of the traditional layers, but not on top of the summations:浪费内存
      • number of filters exceeded 1000 causes instabilities
      • scaling down the residuals before adding by factors between 0.1 and 0.3:残差通道响应值不要太大
  3. blocks

    • V4 ABC:

    • Res ABC:

Xception: Deep Learning with Depthwise Separable Convolutions

  1. 动机

    • Inception modules have been replaced with depthwise separable convolutions
    • significantly outperforms Inception V3 on a larger dataset
    • due to more efficient use of model parameters
  2. 论点

    • early LeNet-style models

      • simple stacks of convolutions for feature extraction and max-pooling operations for spatial sub-sampling
      • increasingly deeper
    • complex blocks

      • Inception modules inspired by NiN
      • be capable of learning richer repre- sentations with less parameters
    • The Inception hypothesis

      • a single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations
      • while Inception factors it into a series of operations that independently look at cross-channel correlations(1x1 convs) and at spatial correlations(3x3/5x5 convs)
      • suggesting that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly

    • inception block先用1x1的conv将原输出映射到3-4个lower space(cross-channel correlations),然后在这些小的3d spaces上做regular conv(maps all correlations )——进一步假设,彻底解耦,第二步只做spatial correlations

    • main differences between “extreme ” Inception and depthwise separable convolution

      • order of the operations:1x1 first or latter
      • non-linearity:depthwise separable convolutions are usually implemented without non-linearities【QUESTION:这和MobileNet里面说的不一样啊,M里面的depthwise也是每层都带了BN和ReLU的】
  3. 要素

    • Convolutional neural networks
    • The Inception family
    • Depthwise separable convolutions
    • Residual connections
  4. 方法

    • architecture

      • a linear stack of depthwise separable convolution layers with residual connections
      • all conv are followed by BN
      • keras的separableConv和depthwiseConv:前者由后者加上一个pointwiseConv组成,最后有activation,中间没有

      • cmp

        • Xception and Inception V3 have nearly the same number of parameters
        • marginally better on ImageNet
        • much larger performance increasement on JFT
        • Residual connections are clearly essential in helping with convergence, both in terms of speed and final classification performance.
        • Effect of intermediate activation:the absence of any non-linearity leads to both faster convergence and better final performance

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

  1. 动机

    • common sense:scaled up the network for better accuracy
    • we systematically study model scaling
    • and identify that carefully balancing network depth, width, and resolution can lead to better performance
    • propose a new scaling method:using compound coefficient to uniformly scale all dimensions of depth/width/resolution
      • on MobileNets and ResNet
      • a new baseline network family EfficientNets
    • much better accuracy and efficiency
  2. 论点

    • previous work scale up one of the three dimensions
      • depth:more layers
      • width:more channels
      • image resolution:higher resolution
    • arbitrary scaling requires tedious manual tuning and often yields sub-optimal accuracy and efficiency
    • uniformly scaling:Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surprisingly such balance can be achieved by simply scaling each of them with constant ratio.

    • neural architecture search:becomes increasingly popular in designing efficient mobile-size ConvNets

  3. 方法

    • problem formulation

      • ConvNets:$N = \bigodot_{i=1…s} F_i^{L_i}(X_{})$, s for stage, L for repeat times, F for function

      • simplify the design problem

        • fixing $F_i$
        • all layers must be scaled uniformly with constant ratio
      • an optimization problem:d for depth coefficients, w for width coefficients, r for resolution coefficients

    • observation 1

      • Scaling up any dimension of network (width, depth, or resolution) improves accuracy, but the accuracy gain diminishes for bigger models. 准确率都会提升,最终都会饱和
      • depth:deeper ConvNet can capture richer and more complex features
      • width:wider networks tend to be able to capture more fine-grained features and are easier to train (commonly used for small size models)但是深度和宽度最好匹配,一味加宽shallow network会较难提取高级特征
      • resolution:higher resolution input can potentially capture more fine-grained patterns

    • observation 2

      • compound scaling:it is critical to balance all dimensions of network width, depth, and resolution

      • different scaling dimensions are not independent 输入更高的resolution,就需要更深的网络,以获取更大的感受野,同时还需要更宽的网络,以捕获更多的细粒度特征

      • compound coefficient $\phi$:

        • $\alpha, \beta, \gamma$ are constants determined by a small grid search, controling the assign among the 3 dimensions [d,w,r]
        • $\phi$ controls how many more resources are available for model scaling
        • the total FLOPS will approximately increase by $2^\phi$
    • efficientNet architecture

      • having a good baseline network is also critical

      • thus we developed a new mobile-size baseline called EfficientNet by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS

      • compound scaling:fix $\phi=1$ and grid search $\alpha, \beta, \gamma$, fix $\alpha, \beta, \gamma$ and use different $\phi$

  4. 实验

    • on MobileNets and ResNets

      • compared to other single-dimension scaling methods
      • compound scaling method improves the accuracy on all
    • on EfficientNet

      • model with compound scaling tends to focus on more relevant regions with more object details
      • while other models are either lack of object details or unable to capture all objects in the images

  5. implementing details

    • RMSProp: decay=0.9, momentum(rho)=0.9,tpu上使用lars
    • BN: momentum=0.99
    • weight decay = 1e-5
    • lr: initial=0.256, decays by 0.97 every 2.4 epochs
    • SiLU activation
    • AutoAugment
    • Stochastic depth: survive_prob = 0.8
    • dropout rate: 0.2 to 0.5 for B0 to B7

EfficientDet: Scalable and Efficient Object Detection

  1. 动机

    • model efficiency
    • for object detection:based on one-stage detector
    • 特征融合:propose a weighted bi-directional feature pyramid network (BiFPN)
    • 网络rescale:uniformly scales the resolution, depth, and width for all backbone
    • achieve better accuracy with much fewer parameters and FLOPs
    • also test on Pascal VOC 2012 semantic segmentation
  2. 论点

    • previous work tends to achieve better efficiency by sacrificing accuracy
    • previous work fuse feature at different resolutions by simply summing up without distinction
    • EfficientNet
      • backbone:combine EfficientNet backbones with our propose BiFPN
      • scale up:jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
    • Existing object detectors
      • two-stage:have a region-of-interest proposal step
      • one-stage:have not, use predefined anchors
  3. 方法

    • BiFPN:efficient bidirectional cross-scale connec- tions and weighted feature fusion

      • FPN:limit是只有top-bottom一条information flow

      • PANet:加上了一条bottom-up path,better accuracy但是more parameters and computations

      • NAS-FPN:基于网络搜索出的结构,irregular and difficult to interpret or modify

      • BiFPN

        • remove those nodes that only have one input edge:只有一条输入的节点,没做到信息融合
        • add an extra edge from the original input to output node if they are at the same level:fuse more features without adding much cost
        • repeat blocks

      • Weighted Feature Fusion

        • since different input features are at different resolutions, they usually contribute to the output feature unequally
        • learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel)
        • weight normalization

          • Softmax-based:$O=\sum_i \frac{e^{w_i}}{\sum_j e^{w_j}}*I_i$
          • Fast normalized:$O=\sum_i \frac{w_i}{\epsilon + \sum_j w_j}*I_i$,Relu is applied after each $w_i$ to keep non-negative
      • EfficientDet

        • ImageNet-pretrained Effi- cientNets as the backbone
        • BiFPN serves as the feature network
        • the fused features(level 3-7) are fed to a class and box network respectively

        • compound scaling

          • backbone:reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6

          • feature network:

            • depth(layers):$D=3+\phi$
            • width(channes):$W=64 \cdot (1.35^{\phi}) $
          • box/class prediction network:

            • depth:$D=3+[\phi/3]$
            • width:same as FPN
          • resolution

            • use feature 3-7:must be dividable by $2^7$
            • $R=512+128*\phi$
          • EfficientDet-D0 ($\phi=0$) to D7 ($\phi=7$)

  4. 实验

    • for object detection
      • train
        • Learning rate is linearly increased from 0 to 0.16 in the first training epoch and then annealed down
        • employ commonly-used focal loss
        • 3x3 anchors
      • compare
        • low-accuracy regime:低精度下,EfficientDet-D0和yoloV3差不多
        • 中等精度,EfficientDet-D1和Mask-RCNN差不多
        • EfficientDet-D7 achieves a new state-of-the-art
    • for semantic segmentation
      • modify
        • keep feature level {P2,P3,…,P7} in BiFPN
        • but only use P2 for the final per-pixel classification
        • set the channel size to 128 for BiFPN and 256 for classification head
        • Both BiFPN and classification head are repeated by 3 times
      • compare
        • 和deeplabv3比的,COCO数据集
        • better accuracy and fewer FLOPs
    • ablation study
      • backbone improves accuracy v.s. resnet50
      • BiFPN improves accuracy v.s. FPN
      • BiFPN achieves similar accuracy as repeated FPN+PANet
      • BiFPN + weghting achieves the best accuracy
      • Normalized:softmax和fast版本效果差不多,每个节点的weight在训练开始迅速变化(suggesting different features contribute to the feature fusion unequally)
      • Compound Scaling:这个比其他只提高一个指标的效果好就不用说了
  5. 超参:

    efficientNet和efficientDet的resolution是不一样的,因为检测还有neck和head,层数更深,所以resolution更大

EfficientNetV2: Smaller Models and Faster Training

  1. 动机

    • faster training speed and better parameter efficiency
    • use a new op: Fused-MBConv
    • propose progressive learning:adaptively adjuts regularization & image size
  2. 方法

    • review of EfficientNet

      • large image size

        • large memory usage,small batch size,long training time
        • thus propose increasing image size gradually in V2
      • extensive depthwise conv

        • often cannot fully utilize modern accelerators
        • thus introduce Fused-MBConv in V2:When applied in early stage 1-3, Fused-MBConv can improve training speed with a small overhead on parameters and FLOPs

      • equally scaling up

        • proved sub-optimal in nfnets
        • since the stages are not equally contributed to the efficiency & accuracy
        • thus in V2
          • use a non-uniform scaling strategy:gradually add more layers to later stages(s5 & s6)
          • restrict the max image size
    • EfficientNet V2 Architecture

      • basic ConvBlock

        • use fused-MBConv in the early layers
        • use MBConv in the latter layers
      • expansion ratios

        • use smaller expansion ratios
        • 因为同样的通道数,fused-MB比MB的参数量大
      • kernel size

        • 全图3x3,没有5x5了
        • add more layers to compensate the reduced receptive field
      • last stride 1 stage

        • effv1是7个stage
        • effv2有6个stage

      • scaling policy

        • compound scaling:R、W、D一起scale
        • 但是限制了最大inference image size=480(train=384)
        • gradually add more layers to later stages (s5 & s6)
    • progressive learning

      • large models require stronger regularization

      • larger image size leads to more computations with larger capacity,thus also needs stronger regularization

      • training process

        • in the early training epochs, we train the network with smaller images and weak regularization
        • gradually increase image size but also making learning more difficult by adding stronger regularization

      • adaptive params

        • image size
        • dropout rate
        • randAug magnitude
        • mixup alpha
        • 给定最大最小值,stage N,使用linear interpolation

    • train&test details

      • RMSProp optimizer with decay 0.9 and momentum 0.9
      • batch norm momentum 0.99
      • weight decay 1e-5
      • trained for 350 epochs with total batch size 4096
      • Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs
      • exponential moving average with 0.9999 decay rate
      • stochastic depth with 0.8 survival probability

      • 4 stages (87 epochs per stage):early stage with weak regularization & later stronger

      • maximum image size for training is about 20% smaller than inference & no further finetuning