MobileNets

preview

  1. 动机

    • 计算力有限
    • 模型压缩/使用小模型
  2. 深度可分离卷积 Depthwise Separable Convolution

    • 将标准卷积拆分为两个操作:深度卷积(depthwise convolution) 和逐点卷积(pointwise convolution)
    • 标准卷积:参数量k*k*input_channel*output_channel
    • 深度卷积(depthwise convolution) :针对每个输入通道采用不同的卷积核,参数量k*k*input_channel
    • 逐点卷积(pointwise convolution):就是普通的卷积,只不过其采用1x1的卷积核,参数量1*1*input_channel*output_channel
    • with BN and ReLU:

    • DW没有改变通道数的能力,如果输入层的通道数很少,DW也只能在低维空间提特征,因此V2提出先对原始输入做expansion,用一个非线性PW升维,然后DW,然后再使用一个PW降维,值得注意的是,第二个PW不使用非线性激活函数,因为作者认为,relu作用在低维空间上会导致信息损失。

  3. 进一步缩减计算量

    • 通道数缩减:宽度因子 alpha
    • 分辨率缩减:分辨率因子rho
  4. papers

    • [V1 CVPR2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,Google,主要贡献Depthwise Separable Convolution
    • [V2 CVPR2018] MobileNetV2: Inverted Residuals and Linear Bottlenecks,Google,主要贡献inverted residual with linear bottleneck
    • [V3 ICCV2019] Searching for MobileNetV3,Google,模型结构升级多了SE(inverted-res-block + SE-block),是通过NAS而非手动设计
    • [EfficientNet-lite ICML2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,efficientNet家族的scale-down版本,原始的EfficientNet是基于Mobile3的basic block,而Lite版本有很多专供移动端的改动:去掉SE、改用RELU6、
    • [MobileOne 2022] An Improved One millisecond Mobile Backbone,Apple,

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

  1. 动机
    • efficient models:uses depthwise separable convolutions and two simple global hyper-parameters
    • resource and accuracy tradeoffs
    • a class of network architectures that allows a model developer to specifically choose a small network that matches the resource restrictions (latency, size) for their application
  2. 论点:

    • the general trend has been to make deeper and more complicated networks in order to achieve higher accuracy
    • not efficient on computationally limited platform
    • building small and efficient neural networks:either compressing pretrained networks or training small networks directly
    • Many papers on small networks focus only on size but do not consider speed
    • speed & size 不完全对等
    • size:depthwise separable convolutions, bottleneck approaches, compressing pretrained networks, distillation
  3. 方法

    • depthwise separable convolutions

      • a form of factorized convolutions:a standard conv splits into 2 layers
      • factorize the filtering and combination steps of standard conv
      • drastically reducing computation and model size to $\frac{1}{N} + \frac{1}{D_k^2}$
      • use both batchnorm and ReLU nonlinearities for both layers
      • MobileNet uses 3 × 3 depthwise separable convolutions which bring between 8 to 9 times less computation

    • MobileNet

      • the first layer is a full convolution, the rest depthwise separable convolutions
      • down sampling is handled with strided convolution
      • all layers are followed by a BN and ReLU nonlinearity
      • a final average pooling reduces the spatial resolution to 1 before the fully connected layer.
      • the final fully connected layer has no nonlinearity and feeds into a softmax layer for classification

    • training so few parameters

      • RMSprop
      • less regularization and data augmentation techniques because small models have less trouble with overfitting
      • it was important to put very little or no weight decay (l2 regularization)
      • do not use side heads or label smoothing or image distortions
    • Width Multiplier: Thinner Models

      • thin a network uniformly at each layer
      • the input channels $M$ and output channels $N$ becomes $\alpha M$ and $\alpha N$
      • $\alpha=1$:baseline MobileNet $\alpha<1$:reduced MobileNet
      • reduce the parameters roughly by $\alpha^2$
    • Resolution Multiplier: Reduced Representation

      • apply this to the input image
      • the input resolution of the network is typically 224, 192, 160 or 128
      • $\rho=1$:baseline MobileNet $\rho<1$:reduced MobileNet
      • reduce the parameters roughly by $\rho^2$

  4. 结论

    • using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet but saving tremendously on mult-adds and parameters

    • at similar computation and number of parameters, thinner MobileNets is 3% better than making them shallower

    • trade-offs based on the two hyper-parameters

MobileNetV2: Inverted Residuals and Linear Bottlenecks

  1. 动机

    • a new mobile architecture
      • based on an inverted residual structure
      • remove non-linearities in the narrow layers in order to maintain representational power
    • prove on multiple tasks
      • object detection:SSDLite
      • semantic segmentation:Mobile DeepLabv3
  2. 方法

    • Depthwise Separable Convolutions

      • replace a full convolutional opera- tor with a factorized version
      • depthwise convolution, it performs lightweight filtering per input channel
      • pointwise convolution, computing linear combinations of the input channels
    • Linear Bottlenecks

      • ReLU results in information loss in lower dimension space
      • expansion ratio:if we have lots of channels, information might still be preserved in the other channels
      • linear:bottleneck上面不包含非线性激活单元

    • Inverted residuals

      • bottlenecks actually contain all the necessary information
      • expansion layer acts merely as an implementation detail that accompanies a non-linear transformation

      • parameter count:
      • basic building block is a bottleneck depth-separable convolution with residuals

* interpretation 

    * provides a natural separation between the input/output
    * expansion:capacity
    * layer transformation:expressiveness

* MobileNetV2 model architecture

    * initial filters:32
    * ReLU6:use ReLU6 as the non-linearity because of its robustness when used with low-precision computation  
    * use constant expansion rate between 5 and 10 except the 1st:smaller network inclines smaller and larger larger

    <img src="MobileNets/MobileNetV2.png" width="40%" />

    * comparison with other architectures

    <img src="MobileNets/cmpV2.png" width="40%" />
  1. 实验

    • Object Detection

      • evaluate the performance as feature extractors
      • replace all the regular convolutions with separable convolutions in SSD prediction layers:backbone没有改动,只替换头部的卷积,降低计算量
      • achieves competitive accuracy with significantly fewer parameters and smaller computational complexity
    • Semantic Segmentation

      • build DeepLabv3 heads on top of the second last feature map of MobileNetV2
      • DeepLabv3 heads are computationally expensive and removing the ASPP module significantly reduces the MAdds
    • ablation

      • inverted residual connections:shortcut connecting bottleneck perform better than shortcuts connecting the expanded layers 在少通道的特征上进行短连接
      • linear bottlenecks:linear bottlenecks improve performance, providing support that non-linearity destroys information in low-dimensional space

Searching for MobileNetV3

  1. 动机

    • automated search algorithms and network design work together
    • classification & detection & segmentation
    • a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP)
    • new efficient versions of nonlinearities
  2. 论点

    • reducing
      • the number of parameters
      • the number of operations (MAdds)
      • inference latency
    • related work
      • SqueezeNet:1x1 convolutions
      • MobileNetV1:separable convolution
      • MobileNetV2:inverted residuals
      • ShuffleNet:group convolutions
      • CondenseNet:group convolutions
      • ShiftNet:shift operation
      • MnasNet:MobileNetV2+SE-block,attention modules are placed after the depthwise filters in the expansion
  3. 方法

    • base blocks

      • combination of ideas from [MobileNetV1, MobileNetV2, MnasNet]
      • inverted-res-block + SE-block
      • swish nonlinearity
      • hard sigmoid

    • Network Search

      • use platform-aware NAS to search for the global network structures
      • use the NetAdapt algorithm to search per layer for the number of filters
    • Network Improvements

      • redesign the computionally-expensive layers at the beginning and the end of the network

        • the last block of MobileNetV2’s inverted bottleneck structure
        • move this layer past the final average pooling:移动到GAP后面去,作用在1x1的featuremap上instead of 7x7,曲线救国
      • a new nonlinearity, h-swish

        • the initial set of filters are also expensive:usually start with 32 filters in a full 3x3 convolution to build initial filter banks for edge detection

        • reduce the number of filters to 16 and use the hard swish nonlinearity

        • most of the benefits swish are realized by using them only in the deeper layers:只在后半段网络中用

      • SE-block

        • ratio:all to fixed to be 1/4 of the number of channels in expansion layer
    • MobileNetV3 architecture

  4. 实验

    • Detection

      • use MobileNetV3 as replacement for the backbone feature extractor in SSDLite:改做backbone了
      • reduce the channel counts of C4&C5’s block:因为MobileNetV3原本是被用来输出1000类的,transfer到90类的coco数据集上有些redundant
    • Segmentation

      • as network backbone

      • compare two segmentation heads

        • R-ASPP:reduced design of the Atrous Spatial Pyramid Pooling module with only two branches
        • Lite R-ASPP:类SE-block的设计,大卷积核,大步长

EfficientNet-lite

  1. 没有专门的paper

  2. model zoo

    | model | width | depth | resolution | droprate |
    | ————————— | ——- | ——- | ————— | ———— |
    | efficientnet-lite0 | 1. | 1. | 224 | .2 |
    | efficientnet-lite1 | 1. | 1.1 | 240 | .2 |
    | efficientnet-lite2 | 1.1 | 1.2 | 260 | .3 |
    | efficientnet-lite3 | 1.2 | 1.4 | 280 | .3 |
    | efficientnet-lite4 | 1.4 | 1.8 | 300 | .3 |
    | | | | | |

    • 关键数据:

      • lite4精度可以达到80.4%,同时保持在Pixel 4 CPU上real-time运行:30ms/image
      • latency:10-30ms
      • model size:5M-15M

<img src="MobileNets/lite-size.png" width="40%;" />

3. challenges

    * Quantization量化:移动端设备支持的浮点精度有限——训练后量化,将浮点模型tf model转化成tfLite model(全整数int8/半浮点float16),
    * Heterogeneous hardware移动端设备参差不齐:好多操作不支持,尽量替换成底层支持的op

4. modifications

    * Removed squeeze-and-excitation networks:去掉SE,not well supported

    * swish替换成RELU6:有利于post-training quantization

        * 一开始将浮点替换成整型的时候观察到huge acc drop:75 -> 48
        * 发现是因为浮点太wide-ranged了,直接映射到int8太多精度损失
        * 所以替换成激活区间有限的relu6 [0,6]

        <img src="MobileNets/int8.png" width="40%;" /><img src="MobileNets/quantization.png" width="40%;" />

    * Fixed the stem and head while scaling models up:stem的resolution大,head的channel大,scaleup都对参数量/计算量影响比较大,只scaleup中间的stage

MobileOne: An Improved One millisecond Mobile Backbone

  1. 动机

    • FLOPs & 参数量等指标并不直接和移动端latency相关
    • this paper

      • 调研了各种mobileNets的优化瓶颈:architectural and optimization bottlenecks
      • 提出了MobileOne
    • 精度

      • 低于1ms/image的速度,top 1 acc 75.9%
      • 上面的eff-lite0要10ms,top 1 acc 74.+%,涨点2.3%
      • Mobile- Former精度近似,速度要快38x

  2. 论点

    • previous methods
      • 大部分foucs on 优化FLOPs
      • 而且会引入new architecture designs & custom layers,如hard-swish,这在移动端通常不原生支持
    • existing metric & latency
      • FLOPs does not account for memory cost and degree of parallelism
      • sharing parameters leads to higher FLOPS but smaller model size
      • skip-connections / branching incur memory costs
    • 结构优化
      • 要找到真正限制on-device latency的要素
    • 训练优化
      • 直接训练小模型精度肯定差,通常是decoupling train-time & test-time architecture
      • 进一步地还做了relaxing regularization
    • the proposed MobileOne
      • use basic operators
      • introduces linear branches which get re-parameterized at inference-time:与之前方法的区别是引入了over-parameterization branches【repVGG是把常规的resblock搞成一个线形op了,本文的branch有k个,是把好多个branch合并一起,所以起名叫over?】
      • inference time model has only feed-forward structure
      • generalizes well to other tasks:在分类、检测、分割上都outperforming
  3. 方法

    • Metric Correlations

      • parameter count and FLOPs

      • 很多模型参数量很大,但是latency更小——只要右边的点纵坐标比左边的低都是这种case,如3(efficientnet-b0)&2(shufflenet-v2)

      • FLOPs和参数量近似的情况下,卷积模型通常比他们的transformer counterpart latency更小——可以看到transformer模型都在右下角

      • CPU correlation

        • mobile device与FLOPs适度相关,与参数量基本无关
        • CPU的latency更无关
        • 结构的影响更大,SE-block和skip都影响挺大的
    • Key Bottlenecks

      • Activation Functions

        • 搞了同样的网络架构,用不同的激活函数

        • 激活函数越花,latency越大,归因于synchronization cost,个别激活函数有通过特殊硬件加速的实现

        • 从通用角度,本文选用ReLU
      • Architectural Blocks

        • 根本因素是memory access cost & degree of parallelism
        • 分支越多,memory access cost越大,因为存在多节点交互更多
        • 而像global pooling这种force synchronization需要同步计算的,也影响overall run-time
        • 上面截图有这个
    • MobileOne Architecture

      • MobileOne Block

        • training & test time 不一样
        • training time
          • basic block还是类似mobileV1的depth-wise + point-wise
          • 每个3x3 donv 和 1x1 pconv都变成了一个多分枝的block
          • block有k个 re-parameterizable branch,k是个超参,1-5
          • 除此之外还有两个常规的branch:1x1 dconv & BN
        • inference time

          • 只有一条data steam
          • 合并思路类似repvgg:所有的线形计算都可以合并

      • Model Scaling

        • 提供了5个尺寸的模型,主要变化在channel,深度是没变的

    • Training

      • 小模型训练,need less regularization,但是weight decay对训练初期又很重要
      • cosine schedule for both LR & weight decay
      • 还用了efficientNetV2里面提到的progressive learning:从简单任务开始训练,逐渐增加难度(resolution & dataaug) & 超参(regularization)
      • 还有EMA:model ensemble