The inference time is longer after int8 quantization

My hardware is tesla V100
my inference code is:

def run_mxnet():
    executor = sym_mxnet.simple_bind(ctx=mx.gpu(
        0), data=batch_shape, grad_req='null', force_rebind=True)
    executor.copy_params_from(arg_params, aux_params)
    print('Warming up MXNet')
    for i in range(0, 10):
        y_gen = executor.forward(is_train=False, data=x)
        y_gen[0].wait_to_read()
    # Timing
    print('Starting MXNet timed run')
    start = time.process_time()
    for i in range(0, 1000):
        y_gen = executor.forward(is_train=False, data=x)
        y_gen[0].wait_to_read()
    end = time.time()
    print(time.process_time() - start)

and my quantization code is:

def run_quantize():
    sym, _ = relay.frontend.from_mxnet(sym_mxnet, {'data': batch_shape})
    sym, params = testing.create_workload(sym['main'])
    # mod, params = relay.frontend.from_mxnet(sym, shape={'data': batch_shape}, arg_params=arg_params, aux_params=aux_params)
    # pdb.set_trace()
    with relay.quantize.qconfig(skip_k_conv=0, round_for_shift=True):
        net = relay.quantize.quantize(sym['main'], params=params)
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(net, 'cuda', 'llvm', params=params)
    m = graph_runtime.create(graph, lib, ctx)
    # x = np.random.uniform(size=batch_shape)
    data_tvm = tvm.nd.array(x.astype('float32'))
    m.set_input(**{k: tvm.nd.array(v, ctx) for k, v in params.items()})
    print('Warming up TVM')
    for i in range(0, 10):
        m.set_input("data", data_tvm)
        m.run()
        tvm_output = m.get_output(0)
    print('Starting TVM timed run')
    start = time.process_time()
    m.set_input("data", data_tvm)
    for i in range(0, 1000):
        # m.set_input("data", data_tvm)
        m.run()
    tvm_output = m.get_output(0)
    end = time.time()
    print(time.process_time() - start)

as a result, the total time mxnet used is 292s, and quantized is 384s.

1 Like

Same problem here on x86. Currently, I am testing a couple of models on x86 with quantization:

  • First model
    TVM FP32: 35.05ms
    TVM int8 quantization: 80.ms
    TVM int8 quantization + AutoTVM: 46.87ms

  • Second model
    TVM FP32: 72.85ms
    TVM int8 quantization: 159.33ms
    TVM int8 quantization + AutoTVM: 112.39ms

As you can see TVM with F32 is faster than with quantization even using AutoTVM.

It would be interesting if you could measure the runtime on TVM without quantization, meaning removing the following:

    with relay.quantize.qconfig(skip_k_conv=0, round_for_shift=True):
        net = relay.quantize.quantize(sym['main'], params=params)

That would give you a better sense of the performance differences since you are comparing only with mxnet runtime. In addition, you should try AutoTVM. Would be nice if you can share the results once you have them.

Well, actually it seems that AutoTVM does help improving performance. In my experience, quantization alone does not bring much improves but instead quite often results in an slowdown, so auto tunning is a must to achieve decent performance.