Quantization Story


#1

I think it is worthwhile to have a high-level quantization post explaining the flow and mentioning developers who are involved in different steps. This should improve collaboration, while also putting a high-level story to anybody who wants to explore TVM for quantization.

Frameworks to Relay

As shown in the above figure, there are two different parallel efforts ongoing

  • Automatic Integer Quantization (@ziheng, @vinx13) - It takes a FP32 framework graph and automatically converts it to Int8 within Relay.
  • Accepting Pre-quantized Integer models (@janimesh, @shoubhik) - This approach accepts a pre-quantized model, introduces a Relay dialect called QNN and generates an Int8 Relay graph.

There are few discussions around Relay Automatic FP16 Downcasting. There has not been any RFC yet. @xyzhou and @janimesh are exploring/prototyping this and plan to put up a RFC in next couple of weeks.

Relay Optimizations

  • Target-independent Relay passes - TVM community is continuously adding these passes. Examples are fuse constant, common subexpression elimination etc.
  • Target-dependent Relay passes - These passes transform the Relay graph to optimize it for the target. An example is Legalize or AlterOpLayout transform, where depending on the target, we change the layouts of convolution/dense layer. TVM community is working on improving on both infrastructure to enable such transformation, and adding target-specific layout transformations. Some of this infrastructure work is pre-requisite for a good overall design (https://github.com/dmlc/tvm/issues/3670).

Relay to Hardware

Once we have an optimized Relay graph, we need to write optimized schedules. Like FP32, we have to focus our efforts only on expensive ops like conv2d, dense etc. There are scattered efforts and TVM community is working on unifying them. Some of the developers that have worked on different backends (not necessarily Int8)

Others who are interested/heavily involved - tqchen, zhiics, ramana-arm, yzhliu


#2

Great to see this post @janimesh, great visual! I think that layout optimization is highly tied to getting good target-specific performance results, and it wouldn’t hurt to include in the picture where these layout optimizations come into place. Especially since different frameworks will dictate certain layours (NCHW vs. NHWC), and backends will impose layouts (NCHWc, NCHWnc).


#3

One more thing - for x86 target we are currently only supporting int8 operator on 1x1 convolutions. Is there a reason why we don’t support it on non point-wise convolutions?


#4

Somehow missed this. We do support int8 operator on non-point wise computations as well. The int8 support only works for Skylake. Older Intel generations don’t have any instructions to speedup int8 as far as I know.