For background on Quantization - please read this link (INT8 quantization proposal)
This thread only focuses on quantizing the models, i.e., representing the weights/biases from their current FP32 format to INT8 format, while controlling the drop in the accuracy introduced by the quantization.
A popular technique to quantize the models is to start from a pre-trained model and then quantize the weights/bias. MKLDNN (https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training), Apache MxNet (https://github.com/apache/incubator-mxnet/pull/9552) and TensorRT (https://towardsdatascience.com/low-precision-inference-with-tensorrt-6eb3cda0730b) quantize in this manner. TensorRT and MxNet use smart calibration using a cross validation dataset to further control the drop in accuracy.
Another technique is to perform the quantization while training. Tensorflow (https://www.tensorflow.org/performance/quantization) supports this type of quantization.
For this task, I propose to start from a trained model (similar to MxNet) and quantize weights/biases (both naively and using calibration). This task is independent of what will be the target edge device.
This is a high level action item.
NNVM - Add a new python API that takes a pre-trained NNVM model and generates quantized model. We can look at MxNet PR (https://github.com/apache/incubator-mxnet/pull/9552) to get started. My understanding is that this step does not require any quantized layer implementation. The details of the calibration technique can be found at this link (http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf).
def quantize_model(sym, params, validation_dataset, calibration) # Return quantized weights and bias # Inputs # Calibration - Naive or KL divergence (smarter calibration) # sym - Modifies the network to suppor the calibration # params - input params that will be quantized # validation dataset - for KL divergence
Comments/suggestions are welcome.