Autotune Relay Mobile GPU - CUDA_ERROR_INVALID_PTX on Nano and TX2

I just used the tutorial at https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html and set the

target_host = 'llvm -target=aarch64-linux-gnu'
target = tvm.target.cuda()

RPC training works just fine, but the final RPC execution fails with:

 File "/home/ubuntu/tvm/python/tvm/module.py", line 194, in evaluator
    blob = feval(*args)

  File "tvm/_ffi/_cython/./function.pxi", line 310, in tvm._ffi._cy3.core.FunctionBase.__call__

  File "tvm/_ffi/_cython/./function.pxi", line 245, in tvm._ffi._cy3.core.FuncCall

  File "tvm/_ffi/_cython/./function.pxi", line 234, in tvm._ffi._cy3.core.FuncCall3

  File "tvm/_ffi/_cython/./base.pxi", line 171, in tvm._ffi._cy3.core.CALL

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (3) /home/schadem/tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f79bc0188]
  [bt] (2) /home/schadem/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xe8) [0x7f79c33818]
  [bt] (1) /home/schadem/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x6cc) [0x7f79c3364c]
  [bt] (0) /home/schadem/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4c) [0x7f7946413c]
  File "/home/schadem/tvm/src/runtime/cuda/cuda_module.cc", line 111
TVMError: Except caught from RPC call: [21:28:14] /home/schadem/tvm/src/runtime/module_util.cc:73: Check failed: ret == 0 (-1 vs. 0) : CUDAError: cuModuleLoadData(&(module_[device_id]),
data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

During the tuning everything seems fine on both ends (I cut the task runs and n_trial down intentionally because I wanted to just check the success of execution. Same result with n_trial=1000 or n_trial=2000).

Tuning...                                                                                                                                                                                 
[Task  1/16]  Current/Best:   29.47/  36.49 GFLOPS | Progress: (10/10) | 30.43 s Done.                                                                                                    
[Task  2/16]  Current/Best:   14.97/  33.41 GFLOPS | Progress: (10/10) | 23.04 s Done.                                                                                                    
[Task  3/16]  Current/Best:    1.22/  34.52 GFLOPS | Progress: (10/10) | 28.27 s Done.                                                                                                    
[Task  4/16]  Current/Best:    6.79/  14.22 GFLOPS | Progress: (10/10) | 15.76 s Done.                                                                                                    
[Task  5/16]  Current/Best:    0.00/   1.29 GFLOPS | Progress: (10/10) | 11.90 s Done.                                                                                                    
[Task  6/16]  Current/Best:    1.65/   1.91 GFLOPS | Progress: (10/10) | 14.02 s Done.                                                                                                    
[Task  7/16]  Current/Best:    0.00/   1.36 GFLOPS | Progress: (10/10) | 27.68 s Done.                                                                                                    
[Task  8/16]  Current/Best:    0.00/   1.43 GFLOPS | Progress: (10/10) | 11.05 s Done.                                                                                                    
[Task  9/16]  Current/Best:    0.00/   7.84 GFLOPS | Progress: (10/10) | 20.29 s Done.                                                                                                    
[Task 10/16]  Current/Best:    1.44/  32.46 GFLOPS | Progress: (10/10) | 23.22 s Done.                                                                                                    
[Task 11/16]  Current/Best:    0.00/   9.29 GFLOPS | Progress: (10/10) | 23.47 s Done.                                                                                                    
[Task 12/16]  Current/Best:    1.82/   3.38 GFLOPS | Progress: (10/10) | 22.83 s Done.                                                                                                    
[Task 13/16]  Current/Best:    0.00/  11.17 GFLOPS | Progress: (10/10) | 17.54 s Done.                                                                                                    
[Task 14/16]  Current/Best:   12.57/  19.03 GFLOPS | Progress: (10/10) | 17.68 s Done.                                                                                                    
[Task 15/16]  Current/Best:    4.57/  15.66 GFLOPS | Progress: (10/10) | 27.96 s Done.                                                                                                    
[Task 16/16]  Current/Best:    0.00/   4.36 GFLOPS | Progress: (10/10) | 19.94 s Done.                                                                                                    
Compile...                                                                                                                                                                                
Upload...                                                                                                                                                                                 
Evaluate inference time cost...   

Using tvm branch 0.5, CUDA 10.0 on both a EC2 P3 instance and a Jetson Nano (same for Jetson TX2), llvm 6.0.0.
Weird that the RPC works and during autotune it does not complain - no errors, but then, when it tries to execute the final lib it complains.

How can I fix this? Where am I going wrong?

Thx,
Martin

Are all configs used for compilation valid? You can print config when compiling to see this.

Thank you for your answer. Are you referring to the tvm compilation?

That seems to be fine on all devices (the P3 and the TX2’s and Nano’s). I made sure they use the same configuration for CUDA, USE_LLVM and so on.

I just tried with

target = tvm.target.cuda(model="tx2")

but got the same result…

This setting works:

target = tvm.target.cuda(model="tx2")
from tvm.autotvm.measure.measure_methods import set_cuda_target_arch
set_cuda_target_arch('sm_62')

Found that setting in this thread: Got error on Jetson TX2 with resnet50_v2 CUDA OUT_OF_RESOURCES