[Relay] CUDA_ERROR_INVALID_VALUE occurs when testing conv2d_grad

Hi, I want to test conv2d_grad on GPU, but got some error.

I added this test case in “def test_conv2d_grad()” in /tvm/tests/python/relay/test_op_grad_level2.py
verify_conv2d_grad((1, 736, 17, 17), (128, 736, 1, 1), [1, 1], [0, 0], [1, 1])

when I ran it on GPU, error occurred, this is the error log

Cannot find config for target=cuda, workload=(‘conv2d’, (1, 736, 17, 17, ‘float32’), (128, 736, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda, workload=(‘conv2d_transpose_nchw’, (1, 128, 17, 17, ‘float32’), (128, 736, 1, 1, ‘float32’), (1, 1), (0, 0), ‘float32’). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda, workload=(‘group_conv2d_nchw’, (1, 736, 17, 17, ‘float32’), (94208, 1, 17, 17, ‘float32’), (1, 1), (0, 0), (1, 1), 736, ‘float32’). A fallback configuration is used, which may bring great performance regression.
Traceback (most recent call last):

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 132, in
test_conv2d_grad()

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 125, in test_conv2d_grad
verify_conv2d_grad((1, 736, 17, 17), (128, 736, 1, 1), [1, 1], [0, 0], [1, 1])

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 113, in verify_conv2d_grad
op_res, (grad_input, grad_weight) = intrp.evaluate(bwd_func)(data, weight)

File “/tvm/python/tvm/relay/backend/interpreter.py”, line 316, in _interp_wrapper
return _intrp(opt_expr)

File “/tvm/python/tvm/_ffi/_ctypes/function.py”, line 210, in call
raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (6) /tvm/build/libtvm.so(TVMFuncCall+0x95) [0x7f5ddc183d8a]
[bt] (5) /tvm/build/libtvm.so(tvm::runtime::PackedFunc::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x30) [0x7f5ddb810ecc]
[bt] (4) /tvm/build/libtvm.so(std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x5a) [0x7f5ddb6a82c2]
[bt] (3) /tvm/build/libtvm.so(std::Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocatortvm::runtime::detail::ArgConvertCode > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x58) [0x7f5ddc2051d2]
[bt] (2) /tvm/build/libtvm.so(tvm::runtime::detail::PackFuncVoidAddr
<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocatortvm::runtime::detail::ArgConvertCode > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x23d) [0x7f5ddc2030db]
[bt] (1) /tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x5ae) [0x7f5ddc20167c]
[bt] (0) /tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f5ddb62af2c]
File “/tvm/src/runtime/cuda/cuda_module.cc”, line 215
File “/tvm/src/runtime/module_util.cc”, line 73
TVMError: Check failed: ret == 0 (-1 vs. 0) : CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
grid=(1,94208,1), block=(1,1,1)
// func_name=fused_nn_conv2d_1_kernel0
// CUDA Source
// -----------
extern “C” global void fused_nn_conv2d_1_kernel0( float* restrict placeholder, float* restrict placeholder1, float* restrict compute) {
float compute_local[1];
shared float pad_temp_shared[1];
shared float placeholder_shared[1];
compute_local[0] = 0.000000e+00f;
for (int ry_outer = 0; ry_outer < 17; ++ry_outer) {
for (int rx_outer = 0; rx_outer < 17; ++rx_outer) {
pad_temp_shared[0] = placeholder[((((((int)blockIdx.y) / 128) * 289) + (ry_outer * 17)) + rx_outer)];
placeholder_shared[0] = placeholder1[(((((int)blockIdx.y) * 289) + (ry_outer * 17)) + rx_outer)];
compute_local[0] = (compute_local[0] + (pad_temp_shared[0] * placeholder_shared[0]));
}
}
compute[((int)blockIdx.y)] = compute_local[0];
}
Process finished with exit code 1

It works fine with CPU.
@ vinx13 Could you give me some suggestions?
or Could anyone give me some suggestions? Thanks very much

The default config for group conv is invalid because of too many blocks. You can tune with AutoTVM to get valid configs for operators

Thanks for your reply, it does work.
One problem is that if the parameters of conv changes a little, such as only batch size, we need to tune valid schedule for the new workload again, or error may occur.
Do you have plan to optimize the implementation of conv2d_grad or default schedule for group conv?

Here are some example workloads fromn MXNet InceptionV3, the num_filter parameter of group conv increases to big size in these examples, and the default config doesn’t work for them:

verify_conv2d_grad((10, 80, 73, 73), (192, 80, 3, 3), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 288, 35, 35), (64, 288, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 288, 35, 35), (384, 288, 3, 3), (2, 2), (0, 0), (1, 1))
verify_conv2d_grad((10, 768, 17, 17), (192, 768, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 128, 17, 17), (128, 128, 1, 7), (1, 1), (0, 3), (1, 1))
verify_conv2d_grad((10, 160, 17, 17), (192, 160, 1, 7), (1, 1), (0, 3), (1, 1))
verify_conv2d_grad((10, 1280, 8, 8), (384, 1280, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 1280, 8, 8), (448, 1280, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 448, 8, 8), (384, 448, 3, 3), (1, 1), (1, 1), (1, 1))

Hope this will help, thanks.

We can add a fallback config, something like https://github.com/dmlc/tvm/blob/master/topi/python/topi/x86/conv2d.py#L39
This will make sure that default config is valid and you can run using it. But it is difficult to achieve good performance.

OK, I will try that, thank you.

Does auto-tune work for you? It seems that I have similar error