Lowering QNN conv2d / tflite

@janimesh -

The Requantize operation at the end of conv2d ends up producing an input_scale which is a product of input_tensor_scale and the weight_scale which means that after the tflite frontend the information is lost in the optimization pipeline.

There are 2 options here, one is creating a conv2d operation that is a mega node that contains all the quantized information and the qnn_lowering pass ends up producing the form we want here. Alternatively the Requantize operation needs to carry the input_scale and the weight_scale ?

Any thoughts ?


Trying to understand - Why do we need to store the scales?

In terms of our npu integration or other libraries that we are looking at , we’ve got an interface for conv2d that takes in an (u)int8 quantized input tensor , (u)int8 weight tensor and the output_quantization parameters and returns us an (u)int8.

Once the input_tensor_scale and the weight tensor scales are multiplied the information is lost from what I can see.

I understand it now. I looked into TF and MxNet frameworks earlier, they liked to have a requantize operator separately. I think the reason is that requantize is used very often, and there TF quantized_conv2d does not have a requantize inside.

In your case, I think the NPU quantized_conv2d requires “qnn.conv2d + requantize” wrapped up together. Or maybe we want to call it a Fused operator. I think there are multiple ways to do this

  • If I think from a HW accelerator viewpoint, it will have its requirements that certain ops should be fused. This fused operator will go through 3rd party compiler, which might use TVM or their own codegen. So, we can write a pattern detector, that detects the sequence of Relay operators and replaces them with an accelerator-friendly fused operator. This approach has its cons, because it might be difficult to detect patterns because the IR has become too low-level already.

  • Other option maybe is to create another dialect for your NPU. This one has new operator called NPU.conv2d. To 3rd party codegen, you can give this operator. If you want TVM codegen, this can lower to “qnn.conv2d + qnn.requantize”, which is further lowered to pure-Relay ops.

If we are prototyping, then 2nd option might be faster. First operator requires serious changes in Graph Fusion Relay pass.

Our initial expectation is to split the graph before any relay optimizations run as we would like to try and identify the required subgraph in our compiler module.

Thank you Animesh for a quick real time discussion on the topic.

  1. qnn nodes are a hierarchy over the relay nodes. The design philosophy here is Qnn is a layer on top of the relay nodes and would lower down to normal relay operations as the IR gets transformed through Relay. The idea with the current lowering from mxnet and tflite is that we lower to a common subset as much as possible.

  2. qnn nodes need to keep as much information from the frontends as possible. However finding a common qnn.conv2d representation across all the frameworks is something that we are worried about because of the variety that a conv2d represents and creating a supernode that contains all the combinations of conv2d + relu + bias as a common subset across all the frontends seems to be a hard problem from my understanding of our discussion.

  3. In terms of the current problem we are looking at, adding the input_scale and the weight_scale as additional attributes to QNNConv2DAttributes seems to be simple enough as additional parameters which will remain as long as the node remains at relay but will be lost as we lower from QNN to relay.

  4. A final design point is important with respect to lowering from the frontends down to relay is to try and keep as much of the input graph information for pre-quantized networks for as long as is necessary for passes. This is another way of saying what I have said in point 2 above. We would likely need to go audit our lowering at some point and figure out what else is missing and clean this up as we go along.

regards Ramana