INT8 quantization proposal

In the current context, quantization means reducing the number of bits (aka reducing precision) required to represent the data elements, for example, going from a IEEE 32-bit floating point format to an integer/fixed-point 8-bit format.

This document presents the high-level overview of quantization process, and presents a proposal for implementing that in TVM.

High-level overview

TVM is highly useful to enable high-performance inference on edge devices. Achieving efficient quantization is a key step in progressing towards that direction. Edge devices have tight power budgets and pack much lower compute resources, compared to their server counterparts. Quantization reduces both power and compute requirements, benefitting the edge devices.

Overall, there are two major steps in implementing quantization

Step 1 - Controlling the drop in accuracy

A popular technique is to take a pre-trained model and then quantize the weights/bias. MKLDNN (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training), Apache MxNet (https://github.com/apache/incubator-mxnet/pull/9552) and TensorRT (https://towardsdatascience.com/low-precision-inference-with-tensorrt-6eb3cda0730b) quantize in this manner. TensorRT and MxNet use smart calibration using a cross validation dataset to further reduce the drop in accuracy. Another technique is to perform the quantization with training. Tensorflow (https://www.tensorflow.org/performance/quantization) supports this type of quantization.

Step 2 - Generating an efficient code for the backend

Hardware vendors are adding support for optimized INT8 operations in the hardware (Intel (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training), Nvidia (https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/)). To take full advantage of the hardware, we need to generate code that can generate these new instructions. In addition, since time-consuming layers like convolution have high data reuse property, we also have to find new schedules that can efficiently utilize the hardware.

Proposal

My current proposal is to focus on Intel Skylake and resnet-18 for now and complete an end-to-end implementation.

  • For Step 1, we can start from a trained model (similar to MxNet) and quantize weights (both naively and using calibration) to quantize the parameters.
  • For Step 2, we can start with current TVM convolution layer optimized schedules and explore how new instructions change that schedule. Similarly, we can generate the quantized implementations for other layers in resnet-18.

When both of these steps are flushed out, we can add more backends (Nvidia, ARM)

Action Items

There will likely be many design decisions within each step, but this list is only covering the high level action items. We can open threads for individual action items if need be.

Step 1 - Controlling the drop in accuracy

  1. NNVM - Add a new python API that takes a pre-trained NNVM model and generates quantized model with quantized weights/bias. This step does not require any quantized layer implementation. Therefore, it can be developed in parallel to Step 2. We can look at MxNet PR (https://github.com/apache/incubator-mxnet/pull/9552) to get started.
def quantize_model(sym, params, validation_dataset, calibration)
    # Return quantized weights and bias

    # Inputs
    # Calibration - Naive or KL divergence (smarter calibration)
    # sym - Modifies the network to suppor the calibration
    # params - input params that will be quantized
    # validation dataset - for KL divergence

Step 2 - Generating efficient code for Intel Skylake processor

  1. TOPI - Generate the optimized quantized convolution schedule with optimized hardware instructions.
    1. Understand how does it affect data layout in and across kernels.
    2. Intermediate outputs need higher precision (INT32) to avoid overflow. This will require adding support for mixed precision arithmetic in TVM.
    3. The code generation will rely on LLVM to pattern match to INT8 operations. Intel LLVM team is currently working on that.
  2. TOPI - Generate the optimized quantized schedules for fully connected, pooling, relu layers. The goal is to enable quantization on resnet 18
  3. NNVM - Modify the input graph to support quantization - like add input/output quantization layers, using the quantized models instead of precise ones.
def deploy_quantized_model(sym, qauntized_params)
    # Runs the quantized models

    # Inputs
    # sym - input network - NNVM modifies the network to support quantized inference
    # quantized_params - input params that will be quantized

Comments/suggestions are welcome. We are planning to start on this soon. We should avoid duplicated efforts if anybody is already working on this. Since, there are many steps that can be done in parallel, we can have multiple contributors.

2 Likes

This is good, it would be great if we can break things into separate threads. i.e. getting quantized models and generate efficient intel cpu 8bit code, so the community can have focused threads on things

Good point @tqchen

I have added separate threads. This thread can act as a background post.

  1. Quantizing models (INT8 quantization - Quantizing models)
  2. Code generation for backend (INT8 Quantization - Code generation for backends)