How to build a VTA quantized model like the ResNet tutorial?

anon · February 22, 2019, 10:37am

How do I build a quantized model like this ResNet tutorial from other deep learning frameworks (TensorFlow, PyTorch, etc.) and run it on the VTA?

I saw in this resnet18_qt8.json file that there are many operations in the ResNet tutorial:

    {
      "op": "cast", 
      "name": "resnetv20_conv0_weight_quantized_cast", 
      "attrs": {"dtype": "int32"}, 
      "inputs": [[22, 0, 0]]
    }, 
    {
      "op": "conv2d", 
      "name": "conv2d0", 
      "attrs": {
        "channels": "64", 
        "dilation": "(1, 1)", 
        "groups": "1", 
        "kernel_size": "[7, 7]", 
        "layout": "NCHW", 
        "out_dtype": "int32", 
        "padding": "(3, 3)", 
        "strides": "(2, 2)", 
        "use_bias": "False"
      }, 
      "inputs": [[17, 0, 0], [23, 0, 0]]
    },

But how is this built? Because when I build the model from other frameworks, the graph consists of nodes/operations that are not support by the TVM compiler, such as QuantizeV2, QuantizedConv2D, and so on. So to summarize my questions:

Questions

What is the true, intended way or workflow of running a quantized model on the VTA?
How can we build a VTA quantized model like in the ResNet tutorial?

aca88 · February 25, 2019, 1:03pm

Hey there,

I have also looked into the resnet example of the VTA and although I am no expert, I think I can share my two cents and see if more experienced users reply.

I am going to interpret this question as “what is the intended way of defining a quantized model?”:

I had the feeling that the description of the graph was hand-made or semi automatic.
I think they started from a floating-point model, which they could run natively on the ARM core.
Then they included operations to quantize the floating-point intermediate results (so edges in the graph which are not the classification output) to some integer variant (in some cases int8 and int32).
- Int8 for example is used to quantize the input and weight tensors
- Int32 is used for the accumulated values of the convolutions, but these again get later quantized to int8
- The way they quantize is by first clipping to the minimum/maximal value of the target data type ([-128,127] for int8) and then typecasting.
  - Clipping is how they manage a possible overflow (by saturating to the min/max of the target data type)
  - I guess they use whatever the compiler/architecture implements as a standard rounding operation (Im guessing most architectures round instead of truncating because the former avoids biasing the quantization)
One assumption here (please someone correct me if I am wrong) is that the arithmetic operations will be done in full precision. In other words, when I multiply two int8 the result is actually an int16 and if I have a MAC then the accumulator will be (at least) int16 and the product will NOT be quantized to int8 before the accumulation.
- Quantization only happens after all computations have been done in full precision (at least for one operator). This helps at minimizing the errors due to quantization.
- I suppose your QuantizedConv2D follow similar rules

So now that we have the general way, lets see some specifics of the VTA.

To some extent VTA (and I mean here what is in the FPGA) does expect some elementwise operations fused at the outputs of conv2d.
You can tell by the way the define the schedule (which goes through a list of elementwise operators which NNVM has automatically fused to the conv2d) and the IR passes (here you see what type of operations they could have handled).
Some of these operations are min/max clipping from the original graph, but could have been others.
This means:

If you had elementwise operations at the output of a conv2d, then you would have to make sure that:
1. The VTA FPGA can handle them (i.e. part of the IR pass list and of the ISA)
2. They are not fused to the conv2d, so that the VTA does not have to compute them
- I think somewhere in the source code they add “copy” operators to stop the automatic fusioning of elementwise operations which they could not have mapped easily onto the FPGA. You could do something similar (as in insert some kind of node in your graph to help you for ex. limiting the fusibility).

You might have to also define the graph with the correct quanitzation for a specific implementation of the VTA accelerator.
This is, to the best of my understanding, not yet implemented. As in if you change the native datatypes of the VTA implementation, your model datatypes are not automatically updated.

So you will need to match your VTA implementation (word widths of input, weight and accumulator) to your graph. So the resnet example graph will only work with the VTA implementation with int8 and int32 data types.

Other than that, you are also limited by what schedules are “lowerable” onto the VTA FPGA fabric. As of now, I only see the definition of the schedule of a fused conv2d and relu activation. So if you add other types of convs or activations, your program will most likely only run on the ARM host.

Hopes this helps