Error_no=1 and error_no=4

Hi, everyone.
Here is what happened.
I have two devices, A with x86 cpu and 1080ti, and B(xavier) with arm cpu and nvidia gpu.
Each device has TVM fully installed, and I can autotune my model on each device individually.
Now I want autotune my model on B using A with RPC, cause it is slow when tuning on device B.
Error occurs:

DEBUG:autotvm:No: 503   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):
  [bt] (1) /home/pzq/tvm_cuda10/build/libtvm.so(TVMFuncCall+0x61) [0x7fc72c9d8521]
  [bt] (0) /home/
pzq/tvm_cuda10/build/libtvm.so(+0x122b75b) [0x7fc72c9d375b]
  File "/home/pzq/tvm_cuda10/python/tvm/_ffi/_ctypes/function.py", line 72, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/pzq/tvm_cuda10/python/tvm/autotvm/measure/
measure_methods.py", line 607, in verify_pass
    raise InstantiationError("Skipped because of invalid gpu kernel")
tvm.autotvm.task.space.InstantiationError: **Skipped because of invalid gpu kernel**',),), **error_no=1**, all_cost=0.293758
8691711426, timestamp=1565772783.6429822) [('tile_b', [36, 1, 1, 1]), ('tile_y', [1, 1, 4, 1]), ('tile_x', [1, 1, 1600, 6]), ('tile_rc', [1, 4]), ('auto_unroll_max_step', 1500), ('unroll_explicit', 0)],winograd,None,406375

DEBUG:autotvm:No: 504   GFLOPS: 0.00/0.00       result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):
  [bt] (3) /mnt/nvme/pzq/Desktop/tvm/build/libtvm.so(TVMFuncCall+0x70) [0x7f8437a7a0]
  [bt] (2) /mnt/nvme
/pzq/Desktop/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm
::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime:
:TVMRetValue*&&)+0xe8) [0x7f843ed3e0]
  [bt] (1) /mnt/nvme/pzq/Desktop/tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x6cc) [0x7f843ed214]
  [bt] (0) /
mnt/nvme/pzq/Desktop/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4c) [0x7f83c9108c]
  File "/home/pzq/Desktop/tvm/src/runtime/cuda/cuda_module.cc", line 111
TVMErr',),), **error_no=4**, all_cost=4.153692960739136, ti
mestamp=1565772786.2338314)   [('tile_b', [36, 1, 1, 1]), ('tile_y', [1, 2, 2, 1]), ('tile_x', [200, 16, 1, 3]), ('tile_rc', [1, 4]), ('auto_unroll_max_step', 0), ('unroll_explicit', 0)],winograd,None,107584

I have noticed that GFLOPS is always 0.0/0.0 and error_no = 1 or 4.
What could the problem be like?
What should I do?
I have cuda10.1 on A and cuda10.0 on B.
Should the cuda version on tuning machine A be strictly the same as target device B?

Problem already solved!
CUDA version on both A and B should match!

I have the same situation and get the same errors, and the cuda versions of my two devices are the same. (10.0 <10.0.130 vs 10.0.326> does this matter?) My device B is jetson nano board.

How did you solve it? THS!

error_no=1 usually happens when the generated CUDA code failed to pass the GPU kernel verify pass, which uses device memory size to estimate if the kernel is able to fit the device. If your target device is a small GPU with much less memory, you may see lots of error_no=1.

error_no=4 is runtime error, which usually happens when the CUDA kernel compiled on A cannot be executed on B. CUDA version mismatch is one of the most common reasons. To debug this error, you can try to compile the model with a config that encountered error_no=4 on A, copy it to B and see what error shows up.

Thanks for your reply, I will check it.