preview
动机
- 计算力有限
- 模型压缩/使用小模型
深度可分离卷积 Depthwise Separable Convolution
- 将标准卷积拆分为两个操作:深度卷积(depthwise convolution) 和逐点卷积(pointwise convolution)
- 标准卷积:参数量k*k*input_channel*output_channel
- 深度卷积(depthwise convolution) :针对每个输入通道采用不同的卷积核,参数量k*k*input_channel
- 逐点卷积(pointwise convolution):就是普通的卷积,只不过其采用1x1的卷积核,参数量1*1*input_channel*output_channel
with BN and ReLU:
DW没有改变通道数的能力,如果输入层的通道数很少,DW也只能在低维空间提特征,因此V2提出先对原始输入做expansion,用一个非线性PW升维,然后DW,然后再使用一个PW降维,值得注意的是,第二个PW不使用非线性激活函数,因为作者认为,relu作用在低维空间上会导致信息损失。
进一步缩减计算量
- 通道数缩减:宽度因子 alpha
- 分辨率缩减:分辨率因子rho
papers
- [V1 CVPR2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,Google,主要贡献Depthwise Separable Convolution
- [V2 CVPR2018] MobileNetV2: Inverted Residuals and Linear Bottlenecks,Google,主要贡献inverted residual with linear bottleneck
- [V3 ICCV2019] Searching for MobileNetV3,Google,模型结构升级多了SE(inverted-res-block + SE-block),是通过NAS而非手动设计
- [EfficientNet-lite ICML2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,efficientNet家族的scale-down版本,原始的EfficientNet是基于Mobile3的basic block,而Lite版本有很多专供移动端的改动:去掉SE、改用RELU6、
- [MobileOne 2022] An Improved One millisecond Mobile Backbone,Apple,
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- 动机
- efficient models:uses depthwise separable convolutions and two simple global hyper-parameters
- resource and accuracy tradeoffs
- a class of network architectures that allows a model developer to specifically choose a small network that matches the resource restrictions (latency, size) for their application
论点:
- the general trend has been to make deeper and more complicated networks in order to achieve higher accuracy
- not efficient on computationally limited platform
- building small and efficient neural networks:either compressing pretrained networks or training small networks directly
- Many papers on small networks focus only on size but do not consider speed
- speed & size 不完全对等
- size:depthwise separable convolutions, bottleneck approaches, compressing pretrained networks, distillation
方法
depthwise separable convolutions
- a form of factorized convolutions:a standard conv splits into 2 layers
- factorize the filtering and combination steps of standard conv
- drastically reducing computation and model size to $\frac{1}{N} + \frac{1}{D_k^2}$
- use both batchnorm and ReLU nonlinearities for both layers
MobileNet uses 3 × 3 depthwise separable convolutions which bring between 8 to 9 times less computation
MobileNet
- the first layer is a full convolution, the rest depthwise separable convolutions
- down sampling is handled with strided convolution
- all layers are followed by a BN and ReLU nonlinearity
- a final average pooling reduces the spatial resolution to 1 before the fully connected layer.
the final fully connected layer has no nonlinearity and feeds into a softmax layer for classification
training so few parameters
- RMSprop
- less regularization and data augmentation techniques because small models have less trouble with overfitting
- it was important to put very little or no weight decay (l2 regularization)
- do not use side heads or label smoothing or image distortions
Width Multiplier: Thinner Models
- thin a network uniformly at each layer
- the input channels $M$ and output channels $N$ becomes $\alpha M$ and $\alpha N$
- $\alpha=1$:baseline MobileNet $\alpha<1$:reduced MobileNet
- reduce the parameters roughly by $\alpha^2$
Resolution Multiplier: Reduced Representation
结论
- using depthwise separable convolutions compared to full convolutions only reduces accuracy by 1% on ImageNet but saving tremendously on mult-adds and parameters
- at similar computation and number of parameters, thinner MobileNets is 3% better than making them shallower
- trade-offs based on the two hyper-parameters
MobileNetV2: Inverted Residuals and Linear Bottlenecks
动机
- a new mobile architecture
- based on an inverted residual structure
- remove non-linearities in the narrow layers in order to maintain representational power
- prove on multiple tasks
- object detection:SSDLite
- semantic segmentation:Mobile DeepLabv3
- a new mobile architecture
方法
Depthwise Separable Convolutions
- replace a full convolutional opera- tor with a factorized version
- depthwise convolution, it performs lightweight filtering per input channel
- pointwise convolution, computing linear combinations of the input channels
Linear Bottlenecks
Inverted residuals
- bottlenecks actually contain all the necessary information
- expansion layer acts merely as an implementation detail that accompanies a non-linear transformation
- parameter count:
- basic building block is a bottleneck depth-separable convolution with residuals
* interpretation
* provides a natural separation between the input/output
* expansion:capacity
* layer transformation:expressiveness
* MobileNetV2 model architecture
* initial filters:32
* ReLU6:use ReLU6 as the non-linearity because of its robustness when used with low-precision computation
* use constant expansion rate between 5 and 10 except the 1st:smaller network inclines smaller and larger larger
<img src="MobileNets/MobileNetV2.png" width="40%" />
* comparison with other architectures
<img src="MobileNets/cmpV2.png" width="40%" />
实验
Object Detection
- evaluate the performance as feature extractors
- replace all the regular convolutions with separable convolutions in SSD prediction layers:backbone没有改动,只替换头部的卷积,降低计算量
- achieves competitive accuracy with significantly fewer parameters and smaller computational complexity
Semantic Segmentation
- build DeepLabv3 heads on top of the second last feature map of MobileNetV2
- DeepLabv3 heads are computationally expensive and removing the ASPP module significantly reduces the MAdds
ablation
Searching for MobileNetV3
动机
- automated search algorithms and network design work together
- classification & detection & segmentation
- a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP)
- new efficient versions of nonlinearities
论点
- reducing
- the number of parameters
- the number of operations (MAdds)
- inference latency
- related work
- SqueezeNet:1x1 convolutions
- MobileNetV1:separable convolution
- MobileNetV2:inverted residuals
- ShuffleNet:group convolutions
- CondenseNet:group convolutions
- ShiftNet:shift operation
- MnasNet:MobileNetV2+SE-block,attention modules are placed after the depthwise filters in the expansion
- reducing
方法
base blocks
- combination of ideas from [MobileNetV1, MobileNetV2, MnasNet]
- inverted-res-block + SE-block
- swish nonlinearity
- hard sigmoid
Network Search
- use platform-aware NAS to search for the global network structures
- use the NetAdapt algorithm to search per layer for the number of filters
Network Improvements
redesign the computionally-expensive layers at the beginning and the end of the network
- the last block of MobileNetV2’s inverted bottleneck structure
- move this layer past the final average pooling:移动到GAP后面去,作用在1x1的featuremap上instead of 7x7,曲线救国
a new nonlinearity, h-swish
the initial set of filters are also expensive:usually start with 32 filters in a full 3x3 convolution to build initial filter banks for edge detection
reduce the number of filters to 16 and use the hard swish nonlinearity
most of the benefits swish are realized by using them only in the deeper layers:只在后半段网络中用
SE-block
- ratio:all to fixed to be 1/4 of the number of channels in expansion layer
MobileNetV3 architecture
实验
Detection
- use MobileNetV3 as replacement for the backbone feature extractor in SSDLite:改做backbone了
- reduce the channel counts of C4&C5’s block:因为MobileNetV3原本是被用来输出1000类的,transfer到90类的coco数据集上有些redundant
Segmentation
EfficientNet-lite
没有专门的paper
model zoo
| model | width | depth | resolution | droprate |
| ————————— | ——- | ——- | ————— | ———— |
| efficientnet-lite0 | 1. | 1. | 224 | .2 |
| efficientnet-lite1 | 1. | 1.1 | 240 | .2 |
| efficientnet-lite2 | 1.1 | 1.2 | 260 | .3 |
| efficientnet-lite3 | 1.2 | 1.4 | 280 | .3 |
| efficientnet-lite4 | 1.4 | 1.8 | 300 | .3 |
| | | | | |
<img src="MobileNets/lite-size.png" width="40%;" />
3. challenges
* Quantization量化:移动端设备支持的浮点精度有限——训练后量化,将浮点模型tf model转化成tfLite model(全整数int8/半浮点float16),
* Heterogeneous hardware移动端设备参差不齐:好多操作不支持,尽量替换成底层支持的op
4. modifications
* Removed squeeze-and-excitation networks:去掉SE,not well supported
* swish替换成RELU6:有利于post-training quantization
* 一开始将浮点替换成整型的时候观察到huge acc drop:75 -> 48
* 发现是因为浮点太wide-ranged了,直接映射到int8太多精度损失
* 所以替换成激活区间有限的relu6 [0,6]
<img src="MobileNets/int8.png" width="40%;" /><img src="MobileNets/quantization.png" width="40%;" />
* Fixed the stem and head while scaling models up:stem的resolution大,head的channel大,scaleup都对参数量/计算量影响比较大,只scaleup中间的stage
MobileOne: An Improved One millisecond Mobile Backbone
动机
论点
- previous methods
- 大部分foucs on 优化FLOPs
- 而且会引入new architecture designs & custom layers,如hard-swish,这在移动端通常不原生支持
- existing metric & latency
- FLOPs does not account for memory cost and degree of parallelism
- sharing parameters leads to higher FLOPS but smaller model size
- skip-connections / branching incur memory costs
- 结构优化
- 要找到真正限制on-device latency的要素
- 训练优化
- 直接训练小模型精度肯定差,通常是decoupling train-time & test-time architecture
- 进一步地还做了relaxing regularization
- the proposed MobileOne
- use basic operators
- introduces linear branches which get re-parameterized at inference-time:与之前方法的区别是引入了over-parameterization branches【repVGG是把常规的resblock搞成一个线形op了,本文的branch有k个,是把好多个branch合并一起,所以起名叫over?】
- inference time model has only feed-forward structure
- generalizes well to other tasks:在分类、检测、分割上都outperforming
- previous methods
方法
Metric Correlations
Key Bottlenecks
MobileOne Architecture
MobileOne Block
Model Scaling
Training
v1.5.2