Can't Tune GPU on tx2 remotely

I got following error while I tune GPU on tx2 remotely

DEBUG:autotvm:No: 158   GFLOPS: 0.00/0.00       result: 
MeasureResult(costs=(RuntimeError('Except caught from RPC call: 
[13:18:50] /home/nvidia/src/tvm/src/ru
ntime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [13:18:50] 
/home/nvidia/src/tvm/src/runtime/cuda/cuda_module.cc:91: 
CUDAError: cuModuleLoadData(&(m
odule_[device_id]), data_.c_str()) failed with error: 
CUDA_ERROR_INVALID_PTX\n\n'),), error_no=4, 
all_cost=4.549376487731934, timestamp=1544188730.826613)  [(
'tile_f', [1, 2, 2, 4]), ('tile_y', [1, 1, 1, 1]), ('tile_x', [1, 1, 1, 1]), ( 
'tile_rc', [64, 2]), ('tile_ry', [3, 1]), ('tile_rx', [3, 1]), ('auto_unroll_max
_step', 512), ('unroll_explicit', 0)],direct,None,1184                                                                                                       
DEBUG:autotvm:No: 159   GFLOPS: 0.00/0.00       result: 
MeasureResult(costs=(RuntimeError('Except caught from RPC call: 
[13:18:51] /home/nvidia/src/tvm/src/ru
ntime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [13:18:51] 
/home/nvidia/src/tvm/src/runtime/cuda/cuda_module.cc:91:   
CUDAError: cuModuleLoadData(&(m
odule_[device_id]), data_.c_str()) failed with error:  
CUDA_ERROR_INVALID_PTX\n\n'),), error_no=4, 
all_cost=5.119001150131226, timestamp=1544188731.4662838) 
[(
'tile_f', [2, 1, 8, 1]), ('tile_y', [1, 1, 1, 1]), ('tile_x', [1, 1, 1, 1]), 
('tile_rc', [16, 8]), ('tile_ry', [1, 3]), ('tile_rx', [1, 3]), 
('auto_unroll_max
_step', 512), ('unroll_explicit', 1)],direct,None,5437                                                                                                       
DEBUG:autotvm:No: 160   GFLOPS: 0.00/0.00       result: 
MeasureResult(costs=(RuntimeError('Except caught from RPC call: 
[13:18:52] /home/nvidia/src/tvm/src/ru
ntime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [13:18:52] 
/home/nvidia/src/tvm/src/runtime/cuda/cuda_module.cc:91:  
CUDAError: cuModuleLoadData(&(m
odule_[device_id]), data_.c_str()) failed with error: 
CUDA_ERROR_INVALID_PTX\n\n'),), error_no=4, 
all_cost=5.690924644470215, timestamp=1544188732.0810251) 
[(
'tile_f', [8, 2, 1, 1]), ('tile_y', [1, 1, 1, 1]), ('tile_x', [1, 1, 1, 1]), 
('tile_rc', [32, 4]), ('tile_ry', [1, 3]), ('tile_rx', [1, 3]), 
('auto_unroll_max
_step', 0), ('unroll_explicit', 1)],direct,None,4271                                                                                                         
WARNING:autotvm:Too many errors happen in the tuning. Now is in 
debug mode   

I set target and target_host as following code

target = 'cuda'
target_host = 'llvm -target=aarch64-linux-gnu'

What I have tried :

  • Tune cpu remotely : worked

  • Tune gpu locally on tx2 : though it runs without error in the begining, but it still can’t be tuned because of not having enough of memory in the end


Error 2

  • Compile Module on x86 pc with CPU and copy tar file to tx2
    target = ‘llvm -target=aarch64-linux-gnu’
    target_host = ‘llvm -target=aarch64-linux-gnu’

    Worked

  • Compile Module on x86 pc with GPU and copy tar file to tx2

    target = ‘cuda’
    target_host = ‘llvm -target=aarch64-linux-gnu’

    but got following error


Traceback (most recent call last):
  File "script/run_yolov3_tx2.py", line 82, in <module>
    m.run()
  File "/home/nvidia/src/tvm/python/tvm/contrib/graph_runtime.py", line 155, in run                                                                          
    self._run()
  File "/home/nvidia/src/tvm/python/tvm/_ffi/_ctypes/function.py", line 185, in __call__                                                                     
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/home/nvidia/src/tvm/python/tvm/_ffi/base.py", line 72, in check_call
    raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: [14:19:51] /home/nvidia/src/tvm/src/runtime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [14:19:51] /home/nvidia/src/tvm/src/runtime/cuda/cuda_module.cc:91: CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX                   

Stack trace returned 10 entries:
[bt] (0) /home/nvidia/src/tvm/build/libtvm.so(dmlc::StackTrace[abi:cxx11]()+0x118) [0x7f98123cc0]                                                            
[bt] (1) /home/nvidia/src/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x44) [0x7f981248ec]                                                 
[bt] (2) /home/nvidia/src/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x6d8) [0x7f98768e80]
[bt] (3) /home/nvidia/src/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<8, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xd8) [0x7f98769168]
[bt] (4) /home/nvidia/src/tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f98717a70]
[bt] (5) so/yolov3.tx2.gpu.tvm.tar.so(+0x1ce4) [0x7f7dbe1ce4]
[bt] (6) so/yolov3.tx2.gpu.tvm.tar.so(fuse_conv2d_broadcast_mul_broadcast_add_leaky_relu+0x4e0) [0x7f7dbe17d0]                                               
[bt] (7) /home/nvidia/src/tvm/build/libtvm.so(+0x8a02b4) [0x7f9871e2b4]
[bt] (8) /home/nvidia/src/tvm/build/libtvm.so(+0x8d4ff4) [0x7f98752ff4]
[bt] (9) /home/nvidia/src/tvm/build/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x40) [0x7f98751408]

Do some configurations work, or do all of them fail? If only some configurations fail then this behavior is expected as some configurations may result in invalid GPU code (e.g., too many resources used).

If all configurations fail, check to see if the CUDA version on your compilation/tuning machine matches that on the tx2.

I meet the same problem in tx1. Have you solved this problem.

@eqy Met the same problem on both TX2 and Xavier. Checked CUDA version and everything is same.

I have fixed it by adding following code:

from tvm.autotvm.measure.measure_methods import set_cuda_target_arch
set_cuda_target_arch(‘sm_50’)

Have a try.

@nopattern Hi, thanks for the suggestions, but I tried and failed in the same ‘too many errors’ error message. sm_50, sm_62, and sm_72 are all tested.

What is the output when you try to run a standard CUDA example (without any tuning) on the device?