Time is longer by using relay.quantize


#1

In my case, tvm is used in intel x86 cpu.

With the quantization step, the time of time_evaluator is 10ms. On the contrary, the time of time_evaluator is 6ms.

My code as follows:

load model

sym, arg_params, aux_params = mx.model.load_checkpoint("./mobnetv2_1.0_224x224", 120)

fronted

dtype = {‘data’:‘float32’}
data_shape = {‘data’: (1, 3, 224, 224)}
sym, params = tvm.relay.frontend.from_mxnet(sym, shape=data_shape, dtype=dtype, arg_params=arg_params, aux_params=aux_params)

quantization

sym = tvm.relay.quantize.quantize(sym, params)

build

target = “llvm”
with relay.build_config(opt_level=3):
graph, lib, params = relay.build_module.build(sym, target=target, params=params)


#2

do you have a try about use autotvm for quantized model, If you know, please tell me, thank!


#3

I have notice the same behavior targeting NVIDIA GPUs and X86.

I am confused if is possible to expect improvements only by using quantization or AutoTVM is required together with quantization. It would be nice if someone could clarify this