Hi,
I have searched a lot for any example that quantizes a Floating Point 32 graph to an 8 bit graph and then runs on the TVM.
But could not find any except : https://github.com/vinx13/tvm-cuda-int8-benchmark/blob/master/run_tvm.py
and a discussion link (which seems that this topic is work in progress)
https://discuss.tvm.ai/t/int8-quantization-proposal/516
The first link mentions how to use Relay.quantize to quantize a model in 8 bit.
def bench(name, batch):
sym, data_shape = get_network(name, batch)
data_shape = data_shape[0][1]
sym, _ = relay.frontend.from_mxnet(sym, {'data': data_shape})
sym, params = tvm.relay.testing.create_workload(sym)
with relay.quantize.qconfig(skip_k_conv=0, round_for_shift=True):
sym = relay.quantize.quantize(sym, params)
And then builds it with relay as well:
with relay.build_module.build_config(opt_level=3):
graph, lib, params = relay.build(sym, 'cuda', 'llvm', params=params)
And creates a graph runtime with tvm :
m = graph_runtime.create(graph, lib, ctx)
x = np.random.uniform(size=data_shape)
data_tvm = tvm.nd.array(x.astype('float32'))
m.set_input("data", data_tvm)
m.set_input(**{k:tvm.nd.array(v, ctx) for k, v in params.items()})
m.run()
e = m.module.time_evaluator("run", ctx, number=2000, repeat=3)
t = e(data_tvm).results
t = np.array(t) * 1000
But then finally it uses the same with auto-tuning (autotvm) or am I reading this wrongly?
def main():
with tvm.target.cuda():
with autotvm.apply_history_best(args.log_file):
for batch in [1, 16]:
for name in ['vgg-19', 'resnet-50', 'resnext-50', 'inception_v3', 'drn-c-26', 'dcn-resnet-101']:
bench(name, batch)
Can I modify this code for a simple use case of quantization to int8 without auto-tvm and tuning?
or,
Can there be an example with a simple quantization of a model,then compiling it’s graph and then running the graph.