MobileNet

{{Short description|Family of computer vision models designed for efficient inference on mobile devices}}

{{Infobox software

| name = MobileNet

| developer = Google

| released = April 2017

| latest_release_version = v4

| latest_release_date = September 2024

| repo = {{URL|https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet}}

| programming_language = Python

| license = Apache License 2.0

}}

MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them suitable for on-device inference and edge computing on resource-constrained devices like mobile phones and embedded systems. They were originally designed to be run efficiently on mobile devices with TensorFlow Lite.

The need for efficient deep learning models on mobile devices led researchers at Google to develop MobileNet. {{As of|2024|October}}, the family has four versions, each improving upon the previous one in terms of performance and efficiency.

Features

= V1 =

MobileNetV1 was published in April 2017.{{cite arXiv | eprint=1704.04861 | last1=Howard | first1=Andrew G. | last2=Zhu | first2=Menglong | last3=Chen | first3=Bo | last4=Kalenichenko | first4=Dmitry | last5=Wang | first5=Weijun | last6=Weyand | first6=Tobias | last7=Andreetto | first7=Marco | last8=Adam | first8=Hartwig | title=MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications | date=2017 | class=cs.CV }}{{Cite web |date=June 14, 2017 |title=MobileNets: Open-Source Models for Efficient On-Device Vision |url=https://research.google/blog/mobilenets-open-source-models-for-efficient-on-device-vision/ |access-date=2024-10-18 |website=research.google |language=en}} Its main architectural innovation was incorporation of depthwise separable convolutions. It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size.{{Cite book |last1=Chollet |first1=François |chapter=Xception: Deep Learning with Depthwise Separable Convolutions |date=2017 |title=2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=https://openaccess.thecvf.com/content_cvpr_2017/html/Chollet_Xception_Deep_Learning_CVPR_2017_paper.html |pages=1800–1807 |doi=10.1109/CVPR.2017.195|arxiv=1610.02357 |isbn=978-1-5386-0457-1 }}

The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution (1 \times 1 convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost.

The MobileNetV1 has two hyperparameters: a width multiplier \alpha that controls the number of channels in each layer. Smaller values of \alpha lead to smaller and faster models, but at the cost of reduced accuracy, and a resolution multiplier \rho, which controls the input resolution of the images. Lower resolutions result in faster processing but potentially lower accuracy.

= V2 =

MobileNetV2 was published in March 2019.{{cite arXiv | eprint=1801.04381 | last1=Sandler | first1=Mark | last2=Howard | first2=Andrew | last3=Zhu | first3=Menglong | last4=Zhmoginov | first4=Andrey | last5=Chen | first5=Liang-Chieh | title=MobileNetV2: Inverted Residuals and Linear Bottlenecks | date=2018 | class=cs.CV }}{{Cite web |date=April 3, 2018 |title=MobileNetV2: The Next Generation of On-Device Computer Vision Networks |url=https://research.google/blog/mobilenetv2-the-next-generation-of-on-device-computer-vision-networks/ |access-date=2024-10-18 |website=research.google |language=en}} It uses inverted residual layers and linear bottlenecks.

Inverted residuals modify the traditional residual block structure. Instead of compressing the input channels before the depthwise convolution, they expand them. This expansion is followed by a 1 \times 1 depthwise convolution and then a 1 \times 1 projection layer that reduces the number of channels back down. This inverted structure helps to maintain representational capacity by allowing the depthwise convolution to operate on a higher-dimensional feature space, thus preserving more information flow during the convolutional process.

Linear bottlenecks removes the typical ReLU activation function in the projection layers. This was rationalized by arguing that that nonlinear activation loses information in lower-dimensional spaces, which is problematic when the number of channels is already small.

= V3 =

MobileNetV3 was published in 2019.{{Cite web |date=November 13, 2019 |title=Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and Mobi |url=https://research.google/blog/introducing-the-next-generation-of-on-device-vision-models-mobilenetv3-and-mobilenetedgetpu/ |access-date=2024-10-18 |website=research.google |language=en}}{{Cite journal |last1=Howard |first1=Andrew |last2=Sandler |first2=Mark |last3=Chu |first3=Grace |last4=Chen |first4=Liang-Chieh |last5=Chen |first5=Bo |last6=Tan |first6=Mingxing |last7=Wang |first7=Weijun |last8=Zhu |first8=Yukun |last9=Pang |first9=Ruoming |last10=Vasudevan |first10=Vijay |last11=Le |first11=Quoc V. |last12=Adam |first12=Hartwig |date=2019 |journal=Iccv 2019|title=Searching for MobileNetV3 |url=https://openaccess.thecvf.com/content_ICCV_2019/html/Howard_Searching_for_MobileNetV3_ICCV_2019_paper.html |pages=1314–1324|arxiv=1905.02244 }} The publication included MobileNetV3-Small, MobileNetV3-Large, and MobileNetEdgeTPU (optimized for Pixel 4). They were found by a form of neural architecture search (NAS) that takes mobile latency into account, to achieve good trade-off between accuracy and latency.{{Cite book |last1=Tan |first1=Mingxing |last2=Chen |first2=Bo |last3=Pang |first3=Ruoming |last4=Vasudevan |first4=Vijay |last5=Sandler |first5=Mark |last6=Howard |first6=Andrew |last7=Le |first7=Quoc V. |chapter=MnasNet: Platform-Aware Neural Architecture Search for Mobile |date=June 2019 |title=2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |chapter-url=https://ieeexplore.ieee.org/document/8954198 |publisher=IEEE |pages=2815–2823 |doi=10.1109/CVPR.2019.00293 |arxiv=1807.11626 |isbn=978-1-7281-3293-8}}{{Cite journal |last1=Yang |first1=Tien-Ju |last2=Howard |first2=Andrew |last3=Chen |first3=Bo |last4=Zhang |first4=Xiao |last5=Go |first5=Alec |last6=Sandler |first6=Mark |last7=Sze |first7=Vivienne |last8=Adam |first8=Hartwig |date=2018 |title=NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications |journal=Eccv 2018|url=https://openaccess.thecvf.com/content_ECCV_2018/html/Tien-Ju_Yang_NetAdapt_Platform-Aware_Neural_ECCV_2018_paper.html |pages=285–300|arxiv=1804.03230 }} It used piecewise-linear approximations of swish and sigmoid activation functions (which they called "h-swish" and "h-sigmoid"), squeeze-and-excitation modules,{{Cite journal |last1=Hu |first1=Jie |last2=Shen |first2=Li |last3=Sun |first3=Gang |date=2018 |title=Squeeze-and-Excitation Networks |journal=Eccv 2018|url=https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html |pages=7132–7141}} and the inverted bottlenecks of MobileNetV2.

= V4 =

MobileNetV4 was published in September 2024.{{cite arXiv | eprint=2404.10518 | last1=Qin | first1=Danfeng | last2=Leichner | first2=Chas | last3=Delakis | first3=Manolis | last4=Fornoni | first4=Marco | last5=Luo | first5=Shixin | last6=Yang | first6=Fan | last7=Wang | first7=Weijun | last8=Banbury | first8=Colby | last9=Ye | first9=Chengxi | last10=Akin | first10=Berkin | last11=Aggarwal | first11=Vaibhav | last12=Zhu | first12=Tenghui | last13=Moro | first13=Daniele | last14=Howard | first14=Andrew | title=MobileNetV4 -- Universal Models for the Mobile Ecosystem | date=2024 | class=cs.CV }}{{Cite web |last=Wightman |first=Ross |title=MobileNet-V4 (now in timm) |url=https://huggingface.co/blog/rwightman/mobilenetv4 |access-date=2024-10-18 |website=huggingface.co}} The publication included a large number of architectures found by NAS.

Inspired by Vision Transformers, the V4 series included multi-query attention.{{cite arXiv | eprint=1911.02150 | last1=Shazeer | first1=Noam | title=Fast Transformer Decoding: One Write-Head is All You Need | date=2019 | class=cs.NE }} It also unified both inverted residual and inverted bottleneck from the V3 series with the "universal inverted bottleneck", which includes these two as special cases.

See also

References