MobileNets

preview

动机
- 计算力有限
- 模型压缩／使用小模型
深度可分离卷积 Depthwise Separable Convolution
- 将标准卷积拆分为两个操作：深度卷积(depthwise convolution) 和逐点卷积(pointwise convolution)
- 标准卷积：参数量k*k*input_channel*output_channel
- 深度卷积(depthwise convolution) ：针对每个输入通道采用不同的卷积核，参数量k*k*input_channel
- 逐点卷积(pointwise convolution)：就是普通的卷积，只不过其采用1x1的卷积核，参数量1*1*input_channel*output_channel
- with BN and ReLU：
- DW没有改变通道数的能力，如果输入层的通道数很少，DW也只能在低维空间提特征，因此V2提出先对原始输入做expansion，用一个非线性PW升维，然后DW，然后再使用一个PW降维，值得注意的是，第二个PW不使用非线性激活函数，因为作者认为，relu作用在低维空间上会导致信息损失。
进一步缩减计算量
- 通道数缩减：宽度因子 alpha
- 分辨率缩减：分辨率因子rho
papers
- [V1 CVPR2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications，Google，主要贡献Depthwise Separable Convolution
- [V2 CVPR2018] MobileNetV2: Inverted Residuals and Linear Bottlenecks，Google，主要贡献inverted residual with linear bottleneck
- [V3 ICCV2019] Searching for MobileNetV3，Google，模型结构升级多了SE(inverted-res-block + SE-block)，是通过NAS而非手动设计
- [EfficientNet-lite ICML2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks，efficientNet家族的scale-down版本，原始的EfficientNet是基于Mobile3的basic block，而Lite版本有很多专供移动端的改动：去掉SE、改用RELU6、
- [MobileOne 2022] An Improved One millisecond Mobile Backbone，Apple，

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

动机
- efficient models：uses depthwise separable convolutions and two simple global hyper-parameters
- resource and accuracy tradeoffs
- a class of network architectures that allows a model developer to specifically choose a small network that matches the resource restrictions (latency, size) for their application
论点：
- the general trend has been to make deeper and more complicated networks in order to achieve higher accuracy
- not efficient on computationally limited platform
- building small and efficient neural networks：either compressing pretrained networks or training small networks directly
- Many papers on small networks focus only on size but do not consider speed
- speed & size 不完全对等
- size：depthwise separable convolutions, bottleneck approaches, compressing pretrained networks, distillation
方法
- depthwise separable convolutions
  - a form of factorized convolutions：a standard conv splits into 2 layers
  - factorize the filtering and combination steps of standard conv
  - drastically reducing computation and model size to $\frac{1}{N} + \frac{1}{D_k^2}$
  - use both batchnorm and ReLU nonlinearities for both layers
  - MobileNet uses 3 × 3 depthwise separable convolutions which bring between 8 to 9 times less computation
- MobileNet
  - the first layer is a full convolution, the rest depthwise separable convolutions
  - down sampling is handled with strided convolution
  - all layers are followed by a BN and ReLU nonlinearity
  - a final average pooling reduces the spatial resolution to 1 before the fully connected layer.
  - the final fully connected layer has no nonlinearity and feeds into a softmax layer for classification
- training so few parameters
  - RMSprop
  - less regularization and data augmentation techniques because small models have less trouble with overfitting
  - it was important to put very little or no weight decay (l2 regularization)
  - do not use side heads or label smoothing or image distortions
- Width Multiplier: Thinner Models
  - thin a network uniformly at each layer
  - the input channels $M$ and output channels $N$ becomes $\alpha M$ and $\alpha N$
  - $\alpha=1$：baseline MobileNet $\alpha<1$：reduced MobileNet
  - reduce the parameters roughly by $\alpha^2$
- Resolution Multiplier: Reduced Representation
  - apply this to the input image
  - the input resolution of the network is typically 224, 192, 160 or 128
  - $\rho=1$：baseline MobileNet $\rho<1$：reduced MobileNet
  - reduce the parameters roughly by $\rho^2$
结论
- using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet but saving tremendously on mult-adds and parameters
- at similar computation and number of parameters, thinner MobileNets is 3% better than making them shallower
- trade-offs based on the two hyper-parameters

MobileNetV2: Inverted Residuals and Linear Bottlenecks

动机
- a new mobile architecture
  - based on an inverted residual structure
  - remove non-linearities in the narrow layers in order to maintain representational power
- prove on multiple tasks
  - object detection：SSDLite
  - semantic segmentation：Mobile DeepLabv3
方法
- Depthwise Separable Convolutions
  - replace a full convolutional opera- tor with a factorized version
  - depthwise convolution, it performs lightweight filtering per input channel
  - pointwise convolution, computing linear combinations of the input channels
- Linear Bottlenecks
  - ReLU results in information loss in lower dimension space
  - expansion ratio：if we have lots of channels, information might still be preserved in the other channels
  - linear：bottleneck上面不包含非线性激活单元
- Inverted residuals
  - bottlenecks actually contain all the necessary information
  - expansion layer acts merely as an implementation detail that accompanies a non-linear transformation
  - parameter count：
  - basic building block is a bottleneck depth-separable convolution with residuals

* interpretation 

    * provides a natural separation between the input/output
    * expansion：capacity
    * layer transformation：expressiveness

* MobileNetV2 model architecture

    * initial filters：32
    * ReLU6：use ReLU6 as the non-linearity because of its robustness when used with low-precision computation  
    * use constant expansion rate between 5 and 10 except the 1st：smaller network inclines smaller and larger larger

    <img src="MobileNets/MobileNetV2.png" width="40%" />

    * comparison with other architectures

    <img src="MobileNets/cmpV2.png" width="40%" />

实验
- Object Detection
  - evaluate the performance as feature extractors
  - replace all the regular convolutions with separable convolutions in SSD prediction layers：backbone没有改动，只替换头部的卷积，降低计算量
  - achieves competitive accuracy with significantly fewer parameters and smaller computational complexity
- Semantic Segmentation
  - build DeepLabv3 heads on top of the second last feature map of MobileNetV2
  - DeepLabv3 heads are computationally expensive and removing the ASPP module significantly reduces the MAdds
- ablation
  - inverted residual connections：shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers 在少通道的特征上进行短连接
  - linear bottlenecks：linear bottlenecks improve performance, providing support that non-linearity destroys information in low-dimensional space

Searching for MobileNetV3

动机
- automated search algorithms and network design work together
- classification & detection & segmentation
- a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP)
- new efficient versions of nonlinearities
论点
- reducing
  - the number of parameters
  - the number of operations (MAdds)
  - inference latency
- related work
  - SqueezeNet：1x1 convolutions
  - MobileNetV1：separable convolution
  - MobileNetV2：inverted residuals
  - ShuffleNet：group convolutions
  - CondenseNet：group convolutions
  - ShiftNet：shift operation
  - MnasNet：MobileNetV2+SE-block，attention modules are placed after the depthwise filters in the expansion
方法
- base blocks
  - combination of ideas from [MobileNetV1, MobileNetV2, MnasNet]
  - inverted-res-block + SE-block
  - swish nonlinearity
  - hard sigmoid
- Network Search
  - use platform-aware NAS to search for the global network structures
  - use the NetAdapt algorithm to search per layer for the number of filters
- Network Improvements
  - redesign the computionally-expensive layers at the beginning and the end of the network
    - the last block of MobileNetV2’s inverted bottleneck structure
    - move this layer past the final average pooling：移动到GAP后面去，作用在1x1的featuremap上instead of 7x7，曲线救国
  - a new nonlinearity, h-swish
    - the initial set of filters are also expensive：usually start with 32 filters in a full 3x3 convolution to build initial filter banks for edge detection
    - reduce the number of filters to 16 and use the hard swish nonlinearity
      $swish\ [x]=x*\sigma(x)\\ h\_swish\ [x]=x\frac{ReLU6(x+3)}{6}$
    - most of the benefits swish are realized by using them only in the deeper layers：只在后半段网络中用
  - SE-block
    - ratio：all to fixed to be 1/4 of the number of channels in expansion layer
- MobileNetV3 architecture
实验
- Detection
  - use MobileNetV3 as replacement for the backbone feature extractor in SSDLite：改做backbone了
  - reduce the channel counts of C4&C5’s block：因为MobileNetV3原本是被用来输出1000类的，transfer到90类的coco数据集上有些redundant
- Segmentation
  - as network backbone
  - compare two segmentation heads
    - R-ASPP：reduced design of the Atrous Spatial Pyramid Pooling module with only two branches
    - Lite R-ASPP：类SE-block的设计，大卷积核，大步长

EfficientNet-lite

没有专门的paper
- 基于efficientNet向移动端改进
- https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite
- https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html
model zoo

| model | width | depth | resolution | droprate |
| ————————— | ——- | ——- | ————— | ———— |
| efficientnet-lite0 | 1. | 1. | 224 | .2 |
| efficientnet-lite1 | 1. | 1.1 | 240 | .2 |
| efficientnet-lite2 | 1.1 | 1.2 | 260 | .3 |
| efficientnet-lite3 | 1.2 | 1.4 | 280 | .3 |
| efficientnet-lite4 | 1.4 | 1.8 | 300 | .3 |
| | | | | |
- 关键数据：
  - lite4精度可以达到80.4%，同时保持在Pixel 4 CPU上real-time运行：30ms/image
  - latency：10-30ms
  - model size：5M-15M

<img src="MobileNets/lite-size.png" width="40%;" />

3. challenges

    * Quantization量化：移动端设备支持的浮点精度有限——训练后量化，将浮点模型tf model转化成tfLite model（全整数int8/半浮点float16），
    * Heterogeneous hardware移动端设备参差不齐：好多操作不支持，尽量替换成底层支持的op

4. modifications

    * Removed squeeze-and-excitation networks：去掉SE，not well supported

    * swish替换成RELU6：有利于post-training quantization

        * 一开始将浮点替换成整型的时候观察到huge acc drop：75 -> 48
        * 发现是因为浮点太wide-ranged了，直接映射到int8太多精度损失
        * 所以替换成激活区间有限的relu6 [0,6]

        <img src="MobileNets/int8.png" width="40%;" /><img src="MobileNets/quantization.png" width="40%;" />

    * Fixed the stem and head while scaling models up：stem的resolution大，head的channel大，scaleup都对参数量/计算量影响比较大，只scaleup中间的stage

MobileOne: An Improved One millisecond Mobile Backbone

动机
- FLOPs & 参数量等指标并不直接和移动端latency相关
- this paper
  - 调研了各种mobileNets的优化瓶颈：architectural and optimization bottlenecks
  - 提出了MobileOne
- 精度
  - 低于1ms/image的速度，top 1 acc 75.9%
  - 上面的eff-lite0要10ms，top 1 acc 74.+%，涨点2.3%
  - Mobile- Former精度近似，速度要快38x
论点
- previous methods
  - 大部分foucs on 优化FLOPs
  - 而且会引入new architecture designs & custom layers，如hard-swish，这在移动端通常不原生支持
- existing metric & latency
  - FLOPs does not account for memory cost and degree of parallelism
  - sharing parameters leads to higher FLOPS but smaller model size
  - skip-connections / branching incur memory costs
- 结构优化
  - 要找到真正限制on-device latency的要素
- 训练优化
  - 直接训练小模型精度肯定差，通常是decoupling train-time & test-time architecture
  - 进一步地还做了relaxing regularization
- the proposed MobileOne
  - use basic operators
  - introduces linear branches which get re-parameterized at inference-time：与之前方法的区别是引入了over-parameterization branches【repVGG是把常规的resblock搞成一个线形op了，本文的branch有k个，是把好多个branch合并一起，所以起名叫over？】
  - inference time model has only feed-forward structure
  - generalizes well to other tasks：在分类、检测、分割上都outperforming
方法
- Metric Correlations
  - parameter count and FLOPs
  - 很多模型参数量很大，但是latency更小——只要右边的点纵坐标比左边的低都是这种case，如3(efficientnet-b0)&2(shufflenet-v2)
  - FLOPs和参数量近似的情况下，卷积模型通常比他们的transformer counterpart latency更小——可以看到transformer模型都在右下角
  - CPU correlation
    - mobile device与FLOPs适度相关，与参数量基本无关
    - CPU的latency更无关
    - 结构的影响更大，SE-block和skip都影响挺大的
- Key Bottlenecks
  - Activation Functions
    - 搞了同样的网络架构，用不同的激活函数
    - 激活函数越花，latency越大，归因于synchronization cost，个别激活函数有通过特殊硬件加速的实现
    - 从通用角度，本文选用ReLU
  - Architectural Blocks
    - 根本因素是memory access cost & degree of parallelism
    - 分支越多，memory access cost越大，因为存在多节点交互更多
    - 而像global pooling这种force synchronization需要同步计算的，也影响overall run-time
    - 上面截图有这个
- MobileOne Architecture
  - MobileOne Block
    - training & test time 不一样
    - training time
      - basic block还是类似mobileV1的depth-wise + point-wise
      - 每个3x3 donv 和 1x1 pconv都变成了一个多分枝的block
      - block有k个 re-parameterizable branch，k是个超参，1-5
      - 除此之外还有两个常规的branch：1x1 dconv & BN
    - inference time
      - 只有一条data steam
      - 合并思路类似repvgg：所有的线形计算都可以合并
  - Model Scaling
    - 提供了5个尺寸的模型，主要变化在channel，深度是没变的
- Training
  - 小模型训练，need less regularization，但是weight decay对训练初期又很重要
  - cosine schedule for both LR & weight decay
  - 还用了efficientNetV2里面提到的progressive learning：从简单任务开始训练，逐渐增加难度（resolution & dataaug） & 超参（regularization）
  - 还有EMA：model ensemble