[auto tuning] Auto-tuning is really slow

I try to auto-tuning a complex cnn on gpu. And I registered 8 gpus in rpcserver, but only one gpu is use when tuning with low gpu utility, the speed is really slow, any advice?Thanks.

@comaniac I follow the guide and tutorials to tune the model on 4 1080ti gpu, the gpu utility is offen low with occasionally high, is that normal? And it turns out that the tuning speed is not accelerated significantly compared to tuning on cpu, is that true?

You could follow the issue I referred since some people are already discussing the multi-GPU tuning there.

I am not sure about the tuning speed you mentioned. Are you talking about the tuning time or the performance of the tuned kernel? The speedup should be obvious after tuning if you are talking about the performance; otherwise, you can try to increase the number of trials to 3,000 or 4,000 to make sure the search algorithm is able to find the best schedule.

I mean the tuning time is not speed up, is that true?

No it’s not. GPU is used for measuring the model execution time during the tuning process. It’s not used for accelerating the tuning process.

@comaniac Thank you very much. So is there any ways to accelerating the tuning process? I tuned a detection model wich contains more than 100 tasks, and the time cost is about 2000s for each task, total time cost is huge. Any advice?

Unfortunately, there’s no way to facilitate this process in the current upstream. Although we are planning to solve this problem, it will take several months.

Meanwhile, if you are willing to try an experimental feature, I have an RFC on file that only needs to tune a few numbers of tasks in a model. You can patch the PR and give it a shot.

Thank you very much.
Best,
Edward

I tried your PR, but I cannot run it successfully, is there any docker image of dockerfile?

You may need to pip3 install networkx if you found a missing package.

I use the code in https://github.com/comaniac/tvm/tree/task_selector, and I first build ci-gpu with dockerfile in https://github.com/comaniac/tvm/blob/task_selector/docker/Dockerfile.ci_gpu, then I build the tvm:demo_gpu using dockerfile in https://github.com/comaniac/tvm/blob/task_selector/docker/Dockerfile.demo_gpu but change https://github.com/comaniac/tvm/blob/task_selector/docker/install/install_tvm_gpu.sh#L24 to git clone --depth=1 https://github.com/comaniac/incubator-tvm --recursive and checkout to task-selector and delete https://github.com/comaniac/tvm/blob/task_selector/docker/install/install_tvm_gpu.sh#L27.

But when the image is built, I try your code

autotvm.task.mark_depend(tasks)

But it didn’t work with log as" no mark_depend in autotvm.task", Did I get it right? And if not, how can I build the task:selector images? Thanks.

It seems you didn’t check out the branch correctly. The error says it cannot find mark_depend which is a new function in my POC branch. After the docker image is built, please make sure git branch shows the right one, and python/tvm/autotvm/task/__init__.py has a line from .select import mark_depend.

Please be remineded to use https://github.com/comaniac/tvm instead of https://github.com/comaniac/incubator-tvm.

which branch should I use? select or task selector?

You should use task_selector.

Just curious, pytorch is installed with pip, why it will contain select_import mark_depend? Thank you very much.

I don’t think there’s anything about pytorch?

Sorry, my mistake. It is in python/tvm/autotvm/task/init.py, I am wrong.

Hi,

In the Auto-tuning a convolutional network for NVIDIA GPU tutorialhttps://docs.tvm.ai/tutorials/autotvm/tune_relay_cuda.html it is mentioned that the total running time of the script is 0 minutes 0.155 seconds.

Is it true? By using the RPC Tracker we can make the total 4 hours running time be reduced to only 0.155 seconds?

Thank you very much!

# tune_and_evaluate(tuning_option)

Tuning statement is commented out so that we don’t run actual tuning when building the web page.