CUDA Error cuModuleUnload failed with CUDA_ERROR_ILLEGAL_ADDRESS

Env:

GPU 1080ti
Ubuntu 16.04
CUDA 10.0
cudnn 7.4
Source code:
https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_cuda.html#sphx-glr-tutorials-autotvm-tune-nnvm-cuda-py

Error:

Extract tasks...
Tuning...
[Task  1/12]  Current/Best:  262.32/3720.19 GFLOPS | Progress: (1296/2000) | 1852.64 s Done.
[Task  2/12]  Current/Best:  207.12/ 892.76 GFLOPS | Progress: (792/2000) | 1024.41 s Done.
[Task  3/12]  Current/Best:   57.75/1229.26 GFLOPS | Progress: (936/2000) | 1169.83 s Done.
[Task  4/12]  Current/Best: 2407.01/4956.75 GFLOPS | Progress: (144/2000) | 181.88 sterminate called after throwing an instance of 'dmlc::Error'
  what():  [00:52:33] /home/wxf/tvm_prj/tvm/src/runtime/cuda/cuda_module.cc:41: CUDAError: cuModuleUnload(module_[i]) failed with error: CUDA_ERROR_ILLEGAL_ADDRESS

Stack trace returned 10 entries:
[bt] (0) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11](unsigned long)+0x70) [0x7fd234b4d440]
[bt] (1) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2d) [0x7fd234b4d00d]
[bt] (2) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::CUDAModuleNode::~CUDAModuleNode()+0x182) [0x7fd235259752]
[bt] (3) /home/wxf/tvm_prj/tvm/build/libtvm.so(std::_Sp_counted_ptr_inplace<tvm::runtime::DSOModuleNode, std::allocator<tvm::runtime::DSOModuleNode>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x99) [0x7fd2352277d9]
[bt] (4) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x104b749) [0x7fd23522d749]
[bt] (5) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x10604b1) [0x7fd2352424b1]
[bt] (6) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCFreeFunc(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x3f) [0x7fd23524009f]
[bt] (7) /home/wxf/tvm_prj/tvm/build/libtvm.so(void tvm::runtime::RPCSession::EventHandler::CallHandler<void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>(void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*))+0x5f) [0x7fd23524385f]
[bt] (8) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::HandlePackedCall()+0x328) [0x7fd235241758]
[bt] (9) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::SwitchToState(tvm::runtime::RPCSession::EventHandler::State)+0x297) [0x7fd2352443b7]


[Task  4/12]  Current/Best: 1131.32/4956.75 GFLOPS | Progress: (360/2000) | 453.93 sterminate called after throwing an instance of 'dmlc::Error'
  what():  [00:56:44] /home/wxf/tvm_prj/tvm/src/runtime/cuda/cuda_module.cc:41: CUDAError: cuModuleUnload(module_[i]) failed with error: CUDA_ERROR_ILLEGAL_ADDRESS

Stack trace returned 10 entries:
[bt] (0) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11](unsigned long)+0x70) [0x7fd234b4d440]
[bt] (1) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2d) [0x7fd234b4d00d]
[bt] (2) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::CUDAModuleNode::~CUDAModuleNode()+0x182) [0x7fd235259752]
[bt] (3) /home/wxf/tvm_prj/tvm/build/libtvm.so(std::_Sp_counted_ptr_inplace<tvm::runtime::DSOModuleNode, std::allocator<tvm::runtime::DSOModuleNode>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x99) [0x7fd2352277d9]
[bt] (4) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x104b749) [0x7fd23522d749]
[bt] (5) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x10604b1) [0x7fd2352424b1]
[bt] (6) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCFreeFunc(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x3f) [0x7fd23524009f]
[bt] (7) /home/wxf/tvm_prj/tvm/build/libtvm.so(void tvm::runtime::RPCSession::EventHandler::CallHandler<void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>(void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*))+0x5f) [0x7fd23524385f]
[bt] (8) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::HandlePackedCall()+0x328) [0x7fd235241758]
[bt] (9) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::SwitchToState(tvm::runtime::RPCSession::EventHandler::State)+0x297) [0x7fd2352443b7]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [00:57:01] /home/wxf/tvm_prj/tvm/src/runtime/cuda/cuda_module.cc:41: CUDAError: cuModuleUnload(module_[i]) failed with error: CUDA_ERROR_ILLEGAL_ADDRESS

Stack trace returned 10 entries:
[bt] (0) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11](unsigned long)+0x70) [0x7fd234b4d440]
[bt] (1) /home/wxf/tvm_prj/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2d) [0x7fd234b4d00d]
[bt] (2) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::CUDAModuleNode::~CUDAModuleNode()+0x182) [0x7fd235259752]
[bt] (3) /home/wxf/tvm_prj/tvm/build/libtvm.so(std::_Sp_counted_ptr_inplace<tvm::runtime::DSOModuleNode, std::allocator<tvm::runtime::DSOModuleNode>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x99) [0x7fd2352277d9]
[bt] (4) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x104b749) [0x7fd23522d749]
[bt] (5) /home/wxf/tvm_prj/tvm/build/libtvm.so(+0x10604b1) [0x7fd2352424b1]
[bt] (6) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCFreeFunc(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)+0x3f) [0x7fd23524009f]
[bt] (7) /home/wxf/tvm_prj/tvm/build/libtvm.so(void tvm::runtime::RPCSession::EventHandler::CallHandler<void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>(void (*)(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*))+0x5f) [0x7fd23524385f]
[bt] (8) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::HandlePackedCall()+0x328) [0x7fd235241758]
[bt] (9) /home/wxf/tvm_prj/tvm/build/libtvm.so(tvm::runtime::RPCSession::EventHandler::SwitchToState(tvm::runtime::RPCSession::EventHandler::State)+0x297) [0x7fd2352443b7]

Related

SRC Code

src version, commit id: 242daeea0b7d9b8d8943e5feb6dd0bea555508f8

I’m getting similar at the same place. I am able to reproduce it with this PTX below, which cuModuleLoadData loads up with a CUDA_SUCCESS, cuModuleGetFunction gets a CUDA_SUCCESS and cuLaunchKernel gets a CUDA_SUCCESS. Just the cuModuleUnload returns a CUDA_ILLEGAL_ADDRESS. I want to point fingers at a cuda bug.

(Google doc because it is too big for a post here)

Update: Reinstalling video driver seemed to resolve the weird issue. So probably a cuda driver bug.