[SOLVED] Auto-tuning CUDA: Poor Performance


Update: I figured out where the massive performance difference comes from. I included the call to executor.evaluate() when timing. So this issue is solved. Feel free to delete this post. I’d do it myself but I don’t seem to have permission to delete my own topic.


I’m trying to use TVM to auto-tune neural networks that I import from .onnx files.
I followed this tutorial on how to load ONNX files and this tutorial on how to extract and tune tasks. I set target='cuda', n_trial=2000 and early_stopping=600 just like it is in the tutorial.

I did some benchmarks and noticed that even after 2000 (600) steps of tuning TVM is a lot slower than TensorFlow GPU. The times are also not much better than they are without any auto-tuning at all. Here are some numbers (average time/prediction in ms over 10 runs):

    Model                   TensorFlow GPU  TVM (no tuning) TVM (tuning)
    AlexNet (CIFAR-10)      4.832           2778.06         1865.881
    ResNet50 (CIFAR-10)     6.190           3001.779        2540.002
    ResNet50 (ImageNet)     8.779           2832.737        2587.481
    WideResNet (CIFAR-10)   4.130           2690.224        2584.722
    WideResNet (ImageNet)   6.433           2737.025        2548.646

Does anyone have an idea why TVM is so much slower than TensorFlow and why auto-tuning doesn’t improve the performance very much?