TVM model warmup need much more time than mxnet

Question

Modifing and running tvm/app/benchmark/gpu_imagenet_bench.py and tutorial/from_mxnet.py in loop of 1000 times for testing ResNet-18 speedup.

The average time without first several loop looks good, while the first several trials have real high time cost.

Platform

i7 + 1080Ti
tvm with CUDA + cudnn + cublas
CUDA version: 8.0

Result

average

benchmark: 1.39 ms
from_mxnet: 1.4 ms
mxnet 1.4 + cudnn: 10.49 ms

first two loop

from_mxnet: 7.53 sec, 18.7 ms
mxnet 1.4 + cudnn: 0.097s, 12 ms

Note

We can see that the first two warm up loop in tvm really need long time, while the mxnet looks ok. Is this normal? Or how can I optimize this part? Or is there any place in TVM to optimize? Thanks a lot!