Floating point graph quantized to 8 bit and run on TVM

Hi,
I have searched a lot for any example that quantizes a Floating Point 32 graph to an 8 bit graph and then runs on the TVM.
But could not find any except : https://github.com/vinx13/tvm-cuda-int8-benchmark/blob/master/run_tvm.py
and a discussion link (which seems that this topic is work in progress)
https://discuss.tvm.ai/t/int8-quantization-proposal/516

The first link mentions how to use Relay.quantize to quantize a model in 8 bit.

def bench(name, batch):
sym, data_shape = get_network(name, batch)
data_shape = data_shape[0][1]
sym, _ = relay.frontend.from_mxnet(sym, {'data': data_shape})
sym, params = tvm.relay.testing.create_workload(sym)
with relay.quantize.qconfig(skip_k_conv=0, round_for_shift=True):
    sym = relay.quantize.quantize(sym, params)

And then builds it with relay as well:

with relay.build_module.build_config(opt_level=3):
    graph, lib, params = relay.build(sym, 'cuda', 'llvm', params=params)

And creates a graph runtime with tvm :

 m = graph_runtime.create(graph, lib, ctx)
x = np.random.uniform(size=data_shape)
data_tvm = tvm.nd.array(x.astype('float32'))
m.set_input("data", data_tvm)
m.set_input(**{k:tvm.nd.array(v, ctx) for k, v in params.items()})
m.run()
e = m.module.time_evaluator("run", ctx, number=2000, repeat=3)
t = e(data_tvm).results
t = np.array(t) * 1000

But then finally it uses the same with auto-tuning (autotvm) or am I reading this wrongly?

def main():
with tvm.target.cuda():
    with autotvm.apply_history_best(args.log_file):
        for batch in [1, 16]:
            for name in ['vgg-19', 'resnet-50', 'resnext-50', 'inception_v3', 'drn-c-26', 'dcn-resnet-101']:
                bench(name, batch)

Can I modify this code for a simple use case of quantization to int8 without auto-tvm and tuning?
or,
Can there be an example with a simple quantization of a model,then compiling it’s graph and then running the graph.

You can remove this line with autotvm.apply_history_best(args.log_file) if you don’t want auto-tuning. The speed without auto-tuning will be slower.

You can also perform auto-tuning on the quantized model (obtained from relay.quantize.quantize) following the tutorial https://docs.tvm.ai/tutorials/autotvm/tune_relay_cuda.html

1 Like

Ohkay.
thanks so much for your reply!

But I’m sorry that I forgot to mention that my system is CPU backend (x86) , so I can only use CPU tune.
I did try another similar example tutorial for auto-tuning : https://docs.tvm.ai/tutorials/autotvm/tune_relay_x86.html

My one follow up question is :

I have been able to succesfully implement quantization of various graphs and auto-tuning examples, but separately, on my x86 CPU machine.
Can I run the x86 auto-tuning tutorial my system first and then execute any of these tutorials ( e.g. relay_quick_start (resnet) , from_keras, etc.) that I have modified to execute quantized graphs, seperately…will that work ? Is the auto-tuning graph specific or same for all graphs and tunes only the hardware?

Auto-tuning is layer-specific (it can be shared across graphs if they have common layers), you need to update auto-tuning tutorial to tune the model you want

Okay, thanks!
So if I have auto-tuned on a CPU using a graph with CNN layers , then I can assume that other graphs using a lot of CNN layers would be already tuned for , upto a certain extent ?

yes if their inputs have the same shapes

1 Like

Okayy @vinx13 , thanks for you reply!
Also, is it possible to tune/improve the accuracy of the model using auto-tune (auto-tune does improve peforrmance of model) ?

no auto-tuning is only related to speed

tvm support int8 quantize on arm ?

1 Like

there are still a few ops missing on arm