Autotune Relay Mobile GPU - CUDA_ERROR_INVALID_PTX on Nano and TX2

Schadix · September 15, 2019, 4:50am

I just used the tutorial at https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html and set the

target_host = 'llvm -target=aarch64-linux-gnu'
target = tvm.target.cuda()

RPC training works just fine, but the final RPC execution fails with:

 File "/home/ubuntu/tvm/python/tvm/module.py", line 194, in evaluator
    blob = feval(*args)

  File "tvm/_ffi/_cython/./function.pxi", line 310, in tvm._ffi._cy3.core.FunctionBase.__call__

  File "tvm/_ffi/_cython/./function.pxi", line 245, in tvm._ffi._cy3.core.FuncCall

  File "tvm/_ffi/_cython/./function.pxi", line 234, in tvm._ffi._cy3.core.FuncCall3

  File "tvm/_ffi/_cython/./base.pxi", line 171, in tvm._ffi._cy3.core.CALL

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (3) /home/schadem/tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f79bc0188]
  [bt] (2) /home/schadem/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xe8) [0x7f79c33818]
  [bt] (1) /home/schadem/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x6cc) [0x7f79c3364c]
  [bt] (0) /home/schadem/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4c) [0x7f7946413c]
  File "/home/schadem/tvm/src/runtime/cuda/cuda_module.cc", line 111
TVMError: Except caught from RPC call: [21:28:14] /home/schadem/tvm/src/runtime/module_util.cc:73: Check failed: ret == 0 (-1 vs. 0) : CUDAError: cuModuleLoadData(&(module_[device_id]),
data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

During the tuning everything seems fine on both ends (I cut the task runs and n_trial down intentionally because I wanted to just check the success of execution. Same result with n_trial=1000 or n_trial=2000).

Tuning...                                                                                                                                                                                 
[Task  1/16]  Current/Best:   29.47/  36.49 GFLOPS | Progress: (10/10) | 30.43 s Done.                                                                                                    
[Task  2/16]  Current/Best:   14.97/  33.41 GFLOPS | Progress: (10/10) | 23.04 s Done.                                                                                                    
[Task  3/16]  Current/Best:    1.22/  34.52 GFLOPS | Progress: (10/10) | 28.27 s Done.                                                                                                    
[Task  4/16]  Current/Best:    6.79/  14.22 GFLOPS | Progress: (10/10) | 15.76 s Done.                                                                                                    
[Task  5/16]  Current/Best:    0.00/   1.29 GFLOPS | Progress: (10/10) | 11.90 s Done.                                                                                                    
[Task  6/16]  Current/Best:    1.65/   1.91 GFLOPS | Progress: (10/10) | 14.02 s Done.                                                                                                    
[Task  7/16]  Current/Best:    0.00/   1.36 GFLOPS | Progress: (10/10) | 27.68 s Done.                                                                                                    
[Task  8/16]  Current/Best:    0.00/   1.43 GFLOPS | Progress: (10/10) | 11.05 s Done.                                                                                                    
[Task  9/16]  Current/Best:    0.00/   7.84 GFLOPS | Progress: (10/10) | 20.29 s Done.                                                                                                    
[Task 10/16]  Current/Best:    1.44/  32.46 GFLOPS | Progress: (10/10) | 23.22 s Done.                                                                                                    
[Task 11/16]  Current/Best:    0.00/   9.29 GFLOPS | Progress: (10/10) | 23.47 s Done.                                                                                                    
[Task 12/16]  Current/Best:    1.82/   3.38 GFLOPS | Progress: (10/10) | 22.83 s Done.                                                                                                    
[Task 13/16]  Current/Best:    0.00/  11.17 GFLOPS | Progress: (10/10) | 17.54 s Done.                                                                                                    
[Task 14/16]  Current/Best:   12.57/  19.03 GFLOPS | Progress: (10/10) | 17.68 s Done.                                                                                                    
[Task 15/16]  Current/Best:    4.57/  15.66 GFLOPS | Progress: (10/10) | 27.96 s Done.                                                                                                    
[Task 16/16]  Current/Best:    0.00/   4.36 GFLOPS | Progress: (10/10) | 19.94 s Done.                                                                                                    
Compile...                                                                                                                                                                                
Upload...                                                                                                                                                                                 
Evaluate inference time cost...

Using tvm branch 0.5, CUDA 10.0 on both a EC2 P3 instance and a Jetson Nano (same for Jetson TX2), llvm 6.0.0.
Weird that the RPC works and during autotune it does not complain - no errors, but then, when it tries to execute the final lib it complains.

How can I fix this? Where am I going wrong?

Thx,
Martin

kevinthesun · September 16, 2019, 5:55pm

Are all configs used for compilation valid? You can print config when compiling to see this.

Schadix · September 16, 2019, 7:13pm

Thank you for your answer. Are you referring to the tvm compilation?

That seems to be fine on all devices (the P3 and the TX2’s and Nano’s). I made sure they use the same configuration for CUDA, USE_LLVM and so on.

I just tried with

target = tvm.target.cuda(model="tx2")

but got the same result…

Schadix · September 16, 2019, 7:32pm

This setting works:

target = tvm.target.cuda(model="tx2")
from tvm.autotvm.measure.measure_methods import set_cuda_target_arch
set_cuda_target_arch('sm_62')

Found that setting in this thread: Got error on Jetson TX2 with resnet50_v2 CUDA OUT_OF_RESOURCES