I just used the tutorial at https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html and set the
target_host = 'llvm -target=aarch64-linux-gnu'
target = tvm.target.cuda()
RPC training works just fine, but the final RPC execution fails with:
File "/home/ubuntu/tvm/python/tvm/module.py", line 194, in evaluator
blob = feval(*args)
File "tvm/_ffi/_cython/./function.pxi", line 310, in tvm._ffi._cy3.core.FunctionBase.__call__
File "tvm/_ffi/_cython/./function.pxi", line 245, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./function.pxi", line 234, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 171, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (3) /home/schadem/tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f79bc0188]
[bt] (2) /home/schadem/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xe8) [0x7f79c33818]
[bt] (1) /home/schadem/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x6cc) [0x7f79c3364c]
[bt] (0) /home/schadem/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4c) [0x7f7946413c]
File "/home/schadem/tvm/src/runtime/cuda/cuda_module.cc", line 111
TVMError: Except caught from RPC call: [21:28:14] /home/schadem/tvm/src/runtime/module_util.cc:73: Check failed: ret == 0 (-1 vs. 0) : CUDAError: cuModuleLoadData(&(module_[device_id]),
data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX
During the tuning everything seems fine on both ends (I cut the task runs and n_trial down intentionally because I wanted to just check the success of execution. Same result with n_trial=1000 or n_trial=2000).
Tuning...
[Task 1/16] Current/Best: 29.47/ 36.49 GFLOPS | Progress: (10/10) | 30.43 s Done.
[Task 2/16] Current/Best: 14.97/ 33.41 GFLOPS | Progress: (10/10) | 23.04 s Done.
[Task 3/16] Current/Best: 1.22/ 34.52 GFLOPS | Progress: (10/10) | 28.27 s Done.
[Task 4/16] Current/Best: 6.79/ 14.22 GFLOPS | Progress: (10/10) | 15.76 s Done.
[Task 5/16] Current/Best: 0.00/ 1.29 GFLOPS | Progress: (10/10) | 11.90 s Done.
[Task 6/16] Current/Best: 1.65/ 1.91 GFLOPS | Progress: (10/10) | 14.02 s Done.
[Task 7/16] Current/Best: 0.00/ 1.36 GFLOPS | Progress: (10/10) | 27.68 s Done.
[Task 8/16] Current/Best: 0.00/ 1.43 GFLOPS | Progress: (10/10) | 11.05 s Done.
[Task 9/16] Current/Best: 0.00/ 7.84 GFLOPS | Progress: (10/10) | 20.29 s Done.
[Task 10/16] Current/Best: 1.44/ 32.46 GFLOPS | Progress: (10/10) | 23.22 s Done.
[Task 11/16] Current/Best: 0.00/ 9.29 GFLOPS | Progress: (10/10) | 23.47 s Done.
[Task 12/16] Current/Best: 1.82/ 3.38 GFLOPS | Progress: (10/10) | 22.83 s Done.
[Task 13/16] Current/Best: 0.00/ 11.17 GFLOPS | Progress: (10/10) | 17.54 s Done.
[Task 14/16] Current/Best: 12.57/ 19.03 GFLOPS | Progress: (10/10) | 17.68 s Done.
[Task 15/16] Current/Best: 4.57/ 15.66 GFLOPS | Progress: (10/10) | 27.96 s Done.
[Task 16/16] Current/Best: 0.00/ 4.36 GFLOPS | Progress: (10/10) | 19.94 s Done.
Compile...
Upload...
Evaluate inference time cost...
Using tvm branch 0.5, CUDA 10.0 on both a EC2 P3 instance and a Jetson Nano (same for Jetson TX2), llvm 6.0.0.
Weird that the RPC works and during autotune it does not complain - no errors, but then, when it tries to execute the final lib it complains.
How can I fix this? Where am I going wrong?
Thx,
Martin