4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

My bibliography Save this article

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Author

Listed:

Anton Trusov
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia
Phystech School of Applied Mathematics and Informatics, Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia)
Elena Limonova
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia)
Dmitry Nikolaev
(Smart Engines Service LLC, 117312 Moscow, Russia
Vision Systems Laboratory, Institute for Information Transmission Problems of Russian Academy of Sciences, 127051 Moscow, Russia)
Vladimir V. Arlazarov
(Department of Mathematical Software for Computer Science, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia
Smart Engines Service LLC, 117312 Moscow, Russia)

Registered:

Abstract

Quantization is a widespread method for reducing the inference time of neural networks on mobile Central Processing Units (CPUs). Eight-bit quantized networks demonstrate similarly high quality as full precision models and perfectly fit the hardware architecture with one-byte coefficients and thirty-two-bit dot product accumulators. Lower precision quantizations usually suffer from noticeable quality loss and require specific computational algorithms to outperform eight-bit quantization. In this paper, we propose a novel 4.6-bit quantization scheme that allows for more efficient use of CPU resources. This scheme has more quantization bins than four-bit quantization and is more accurate while preserving the computational efficiency of the later (it runs only 4% slower). Our multiplication uses a combination of 16- and 32-bit accumulators and avoids multiplication depth limitation, which the previous 4-bit multiplication algorithm had. The experiments with different convolutional neural networks on CIFAR-10 and ImageNet datasets show that 4.6-bit quantized networks are 1.5–1.6 times faster than eight-bit networks on the ARMv8 CPU. Regarding the quality, the results of the 4.6-bit quantized network are close to the mean of four-bit and eight-bit networks of the same architecture. Therefore, 4.6-bit quantization may serve as an intermediate solution between fast and inaccurate low-bit network quantizations and accurate but relatively slow eight-bit ones.

Suggested Citation

Anton Trusov & Elena Limonova & Dmitry Nikolaev & Vladimir V. Arlazarov, 2024. "4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs," Mathematics, MDPI, vol. 12(5), pages 1-22, February.

Handle: RePEc:gam:jmathe:v:12:y:2024:i:5:p:651-:d:1344481

Download full text from publisher

More about this item

Keywords

; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:5:p:651-:d:1344481. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Author

Abstract

Suggested Citation

Download full text from publisher

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data