Int8 GPU Autotune Performance issue on Tesla P4

greenyang · November 13, 2019, 11:34pm

Hi, We tried to autotune resnet-50 on a Tesla P4 with latest TVM and the following code: https://github.com/vinx13/tvm-cuda-int8-benchmark/blob/master/tune_relay_int8_cuda.py. The best latency we got from TVM is 2.75ms while TensorRT is 1.75ms. From the article here https://tvm.ai/2019/04/29/opt-cuda-quantized, it seems TVM’s performance with resnet-50 is same or better than TensorRT. So does anyone know what we could be doing wrong here?

vinx13 · November 14, 2019, 4:55am

Things you may need to check:

How your model are quantized? In the blog we quantize all conv and dense layers
Is there a particular layer that is very slow?
Your tuning setting (you can try increase n_trial and early_stop)

greenyang · November 14, 2019, 11:49pm

We quantized the model just the same as in tune_relay_int8_cuda.py: with relay.quantize.qconfig(store_lowbit_output=False): mod[‘main’] = relay.quantize.quantize(mod[‘main’], params=params)
we didn’t find any conv2d layer that is very slow. is there anyway to autotune the dense layers?
we use n_trial = 2000 and early_stop=600

vinx13 · November 15, 2019, 12:27am

The default qconfig has been changing a bit if you are using tvm master. You need to set skip_conv_layers to quantize first layer (it may cause some accuracy drop, it’s a speed accuracy trade off). The dense layer seems not quantized right now by default. To tune dense layers, you need to include dense op when you extract tasks. CUDA dense is tuneable.