INT8 quantization - Quantizing models


#1

For background on Quantization - please read this link (INT8 quantization proposal)

This thread only focuses on quantizing the models, i.e., representing the weights/biases from their current FP32 format to INT8 format, while controlling the drop in the accuracy introduced by the quantization.

High-level overview

A popular technique to quantize the models is to start from a pre-trained model and then quantize the weights/bias. MKLDNN (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training), Apache MxNet (https://github.com/apache/incubator-mxnet/pull/9552) and TensorRT (https://towardsdatascience.com/low-precision-inference-with-tensorrt-6eb3cda0730b) quantize in this manner. TensorRT and MxNet use smart calibration using a cross validation dataset to further control the drop in accuracy.

Another technique is to perform the quantization while training. Tensorflow (https://www.tensorflow.org/performance/quantization) supports this type of quantization.

Proposal

For this task, I propose to start from a trained model (similar to MxNet) and quantize weights/biases (both naively and using calibration). This task is independent of what will be the target edge device.

Action Item

This is a high level action item.

NNVM - Add a new python API that takes a pre-trained NNVM model and generates quantized model. We can look at MxNet PR (https://github.com/apache/incubator-mxnet/pull/9552) to get started. My understanding is that this step does not require any quantized layer implementation. The details of the calibration technique can be found at this link (http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf).

def quantize_model(sym, params, validation_dataset, calibration)
    # Return quantized weights and bias

    # Inputs
    # Calibration - Naive or KL divergence (smarter calibration)
    # sym - Modifies the network to suppor the calibration
    # params - input params that will be quantized
    # validation dataset - for KL divergence

Comments/suggestions are welcome.


#2

Can we take this proposal further and propose a generalized fine-tuning approach to quantizing fp32 models down to an arbitrary integer precision to also support hardware accelerators where precision can be tweaked?

I’m happy to discuss further some ideas on how to get this to work.


#3

There are many ways quantized models can get generated, The current nnvm operator set is sufficient to cover most of quantization/dequantization via rounding and integer casting.

There are many ways a quantization can be done and each comes with different schemes. I would recommend we list out detailed options and try them out


INT8 quantization proposal
#4

@thierry Good point. Theoretically, the calibration technique (http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf) should be able to help us quantize to arbitrary precision. I will keep this in mind while looking for options.


#5

calibration works well up to 8 bit, but more special fine tuning is needed for lower bits. That is why I say there are a lot of possible options to generate quantized models, and we can do a bit of exploration.
For getting the model itself, initial exploration in frameworks like gluoncv would be helpful before going to full scale implementation


#6

Makes sense. Let me explore what are the different options out there and then we can go from there.


#7

Quantizing models

TLDR - For INT8 quantization, we can go with calibration (implemented in MxNET). For more aggressive quantization, we need fine-tuning and there are large number of options. Implementation wise, we might be looking at enabling low precision training.

My proposal - Support INT8 quantization for now using calibration, ensuring support for Intel VNNI and Nvidia D4PA INT8 instructions. Meanwhile, for more aggressive quantizations, try different options to narrow down the choices.


Survey

There are two categories of research efforts

  • No fine tuning - These efforts start from the off-the-shelf models and apply quantization methods. These use some smart calibration to reduce the quantization noise. Applicable to 8 bits (< 1% accuracy loss). Not tested below 8 bits. Typically, these use uniform quantizers.
  • Fine tuning (requires training infrastructure) - Either fine tune after training or train with low precision to have small bit-lengths weights. Go to very small bit lengths. Typically suffer high accuracy losses.

Category 1 - Start from off-the-shelf models

Fixed point quantization of deep convolutional networks

  • https://arxiv.org/pdf/1511.06393.pdf
  • Starts from off-the-shelf model and collects statistics about weights, activations and biases
  • Performs a SQNR analysis to figure out the best bit-width for each layer

KL Divergence Calibration


Category 2 - Training on lower precision

Train and fine tune at the end

Tensorflow implementation


INT16 training

Mixed precision training of CNNs using integer operations

  • https://openreview.net/pdf?id=H135uzZ0-
  • Uses Dynamic Fixed point
  • Performs INT16 calculations. Accumulations in INT32, with occasional accumulations to FP32
  • Good result, coverage of newish CNNs, lots of details about the implementations.

Lower than 8 bits

Quantized Neural Networks

  • https://arxiv.org/pdf/1609.07061.pdf
  • Very low precision training - 1-bit
  • Forward and backward pass happens with 1-bit weight. All the other tensors seem to be single precision.
  • But, parameter update happens at 32 bits. A master copy of weights 32 bits is always saved.

Binary Connect

Ternary weight networks

Concerns - Bad accuracy over the original FP32, evaluation on small datasets

Other papers on similar lines


Other cool ideas

Apprentice

  • https://arxiv.org/pdf/1711.05852.pdf
  • Uses knowledge distillation to train a smaller network using a large network
  • Improves over ternary precision (< 1% of the original)
  • Good results, cool technique

Concerns - Maybe a little too early to go full-blown implementation for this

High-accuracy low precision training

Flexpoint


#8

Thank you for this thorough review. Your proposal is sound, calibration should be sufficient for INT8 at this point.

@tqchen @ziheng have worked on supporting fine-tuning of model parameters for INT8 inference. do you two have something to add to @janimesh’s plan, or would like to point them to some already implemented work?


#9

@janimesh May I ask what the latest status is? Recently I want to use tvm to quantize my model. Is there any work or patch you can share with me? Thanks!


#10

Hi, there are ongoing active efforts to support Quantization. It is not fully supported yet. @ziheng @tqchen will be able to share more details. I would suggest subscribing to this PR - https://github.com/dmlc/tvm/pull/2116


#11

Hi, I have implemented the basic quantization algorithm on top of our new graph IR: Relay. Hopefully, It can be merged into master branch in next week.


#12

Thanks @janimesh @ziheng , I will take a look at that PR.