Compile official mobilenet onnx, get a very slow performance

zacario-li · June 5, 2019, 9:50am

Hi,
I use ONNX official mobilenetv2 model([https://s3.amazonaws.com/onnx-model-zoo/mobilenet/mobilenetv2-1.0/mobilenetv2-1.0.onnx](mobilenetv2 onnx)), and follow the tutorialshttps://docs.tvm.ai/tutorials/frontend/from_onnx.html. But I get a very slow performance, each inference cost 8.9s on Nvidia P40.
when I compile the model, I got lots of WARNING like “WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=(‘conv2d’, (1, 96, 56, 56, ‘float32’), (24, 96, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -model=unknown, workload=(‘depthwise_conv2d_nchw’, (1, 96, 112, 112, ‘float32’), (96, 1, 3, 3, ‘float32’), (2, 2), (1, 1), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression.”

does anyone meet this?

aa12356jm · June 6, 2019, 2:40pm

the warning because of the autotvm,you can ignore the warnings. it just gives some tips for you ,using autotvm ,you can get better performance

zacario-li · June 10, 2019, 6:25am

I found that whatever I set the target to ‘llvm’ or ‘cuda’, the inference time is the same. Does anyone meet this situation?

smart_well · June 10, 2019, 9:04am

you can try set target = ‘cuda -libs=cudnn’ , it is very fast

zacario-li · June 11, 2019, 9:14am

still need 0.027s, If I don’t add ’ -libs=cudnn’, it cost me 0.04s，If I use ‘llvm’, it cost 0.049, I don’t know why

zacario-li · June 18, 2019, 7:45am

I found if I use ‘graph_runtime.create()’ to build the model instead of using
‘create_executor’, the speed is very fast.

tico · June 18, 2019, 8:08am

It would be indeed interesting that someone clarify the difference between graph_runtime.create() and create_executor() in terms of performance

tico · June 18, 2019, 8:42am

Here there is some discussion about differences between build (graph_runtime.create) and create_executor. There is no explanation why one could be faster than the other one.