CUDA ERROR LAUNCH OUT OF RESOURCES despite correct target architecture at auto-tuning

Issue description

Inference on NVIDIA Tesla T4 with GluonCV model mobilenetv2_1.0 auto-tuned with set_cuda_target_arch('sm_75') for batch size 10 and compiled at opt_level=1 fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES.

Steps to reproduce the issue

  1. Prepare hardware and environment that meet the requirements for TVM auto-tuning on an NVIDIA Tesla T4
  2. Set target architecture to that of Tesla T4 by executing tvm.autotvm.measure.measure_methods.set_cuda_target_arch('sm_75') before auto-tuning
  3. Execute auto-tuning for batch size 10 of the GluonCV 0.7.0 classification model mobilenetv2_1.0 according to the tutorial for NVIDIA GPU (https://docs.tvm.ai/tutorials/autotvm/tune_relay_cuda.html), in the environment prepared in step 1, with target architecture set as in step 2
  4. Compile the tuned model at opt_level=1
  5. Execute inference with the tuned and compiled model on batches of size 10 of COCO image data

What’s the expected result?

  • Inference succeeds without errors

What’s the actual result?

  • Inference fails with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Additional details

Suggested solutions

  • Fix TVM so that the correct target architecture setting yields expected results

I would like to know how many steps have you tried in step 3? And did AutoTVM tuning returns reasonable results? (like show positive GFlops output)

In my experience, the current TVM dose have problem on it and cannot deal with CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES problem well. But with enough tuning steps, it should find some schedule that can run successfully.

Actually there’s a irpass VerifyGPUCode tries to figure out those invalid schedules, but seems it is not enabled by default.

Thanks for responding and for your advice.

If by tuning steps you mean the parameter n_trial (the number of configurations tried during tuning), I used the settings in the tutorial, i.e., n_trial = 2000 and early_stopping = 600 (stop trying if finding nothing better after 600 tried configurations). I used XGBTuner.

The output (log) of the tuning looks fine in my opinion - there are a few segfaults now and then, but GFLOPS are all positive and all tuning tasks reach completion. This output is similar to cases where inference at opt_level 1 is successful after tuning.

I will try higher values of n_trial, as I think you are suggesting.

I was not aware of the pass VerifyGPUCode before you pointed it out to me, so thank you for that. Looking at its API, it seems you need to explicitly pass the hardware constraints of your particular GPU to it and it will check that the memory usage and the number of threads in each thread block satisfy those constraints. It seems to me that my “Suggested solution” in this topic would involve making that check implicit, if it is not already, when specifying a tuning target like “sm_75”.