[AutoTVM] During tuning, cuModuleLoadData always raises CUDA_ERROR_UNKNOWN for some users

ake · July 1, 2019, 4:10pm

When I run the auto-tuning example, I always see RuntimeErrors. For example, this is the output of the “Tuning High Performance Convolution on NVIDIA GPUs” script:

Traceback (most recent call last):
  File "./run/tune_conv2d_cuda.py", line 152, in <module>
    func(a_tvm, w_tvm, c_tvm)
  File "/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/function.py", line 128, in __call__
    return f(*args)
  File "/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/_ctypes/function.py", line 185, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python3.5/dist-packages/tvm-0.5.dev0-py3.5-linux-x86_64.egg/tvm/_ffi/base.py", line 72, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [08:36:36] /usr/local/tvm/src/runtime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [08:36:36] /usr/local/tvm/src/runtime/cuda/cuda_module.cc:91: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_UNKNOWN

However, using the same container, my team member does not see the same RuntimeErrors. Additionally, I can run simple pytorch scripts on GPU without a problem.

I found some related issues on the forum, but they didn’t give any hints to how they resolved the problem:

TVMError CUDA_UNKOWN_ERROR, Cuda_error_unknown.

Any ideas on what could cause this, even if the error comes from the CUDA side? CUDA_ERROR_UNKNOWN is not very descriptive, but perhaps you’ve seen similar issues before.