[Relay] CUDA_ERROR_INVALID_VALUE occurs when testing conv2d_grad


#1

Hi, I want to test conv2d_grad on GPU, but got some error.

I added this test case in “def test_conv2d_grad()” in /tvm/tests/python/relay/test_op_grad_level2.py
verify_conv2d_grad((1, 736, 17, 17), (128, 736, 1, 1), [1, 1], [0, 0], [1, 1])

when I ran it on GPU, error occurred, this is the error log

Cannot find config for target=cuda, workload=(‘conv2d’, (1, 736, 17, 17, ‘float32’), (128, 736, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda, workload=(‘conv2d_transpose_nchw’, (1, 128, 17, 17, ‘float32’), (128, 736, 1, 1, ‘float32’), (1, 1), (0, 0), ‘float32’). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=cuda, workload=(‘group_conv2d_nchw’, (1, 736, 17, 17, ‘float32’), (94208, 1, 17, 17, ‘float32’), (1, 1), (0, 0), (1, 1), 736, ‘float32’). A fallback configuration is used, which may bring great performance regression.
Traceback (most recent call last):

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 132, in
test_conv2d_grad()

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 125, in test_conv2d_grad
verify_conv2d_grad((1, 736, 17, 17), (128, 736, 1, 1), [1, 1], [0, 0], [1, 1])

File “/tvm/tests/python/relay/test_op_grad_level2.py”, line 113, in verify_conv2d_grad
op_res, (grad_input, grad_weight) = intrp.evaluate(bwd_func)(data, weight)

File “/tvm/python/tvm/relay/backend/interpreter.py”, line 316, in _interp_wrapper
return _intrp(opt_expr)

File “/tvm/python/tvm/_ffi/_ctypes/function.py”, line 210, in call
raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (6) /tvm/build/libtvm.so(TVMFuncCall+0x95) [0x7f5ddc183d8a]
[bt] (5) /tvm/build/libtvm.so(tvm::runtime::PackedFunc::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x30) [0x7f5ddb810ecc]
[bt] (4) /tvm/build/libtvm.so(std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x5a) [0x7f5ddb6a82c2]
[bt] (3) /tvm/build/libtvm.so(std::Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocatortvm::runtime::detail::ArgConvertCode > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x58) [0x7f5ddc2051d2]
[bt] (2) /tvm/build/libtvm.so(tvm::runtime::detail::PackFuncVoidAddr
<4, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocatortvm::runtime::detail::ArgConvertCode > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x23d) [0x7f5ddc2030db]
[bt] (1) /tvm/build/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x5ae) [0x7f5ddc20167c]
[bt] (0) /tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f5ddb62af2c]
File “/tvm/src/runtime/cuda/cuda_module.cc”, line 215
File “/tvm/src/runtime/module_util.cc”, line 73
TVMError: Check failed: ret == 0 (-1 vs. 0) : CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
grid=(1,94208,1), block=(1,1,1)
// func_name=fused_nn_conv2d_1_kernel0
// CUDA Source
// -----------
extern “C” global void fused_nn_conv2d_1_kernel0( float* restrict placeholder, float* restrict placeholder1, float* restrict compute) {
float compute_local[1];
shared float pad_temp_shared[1];
shared float placeholder_shared[1];
compute_local[0] = 0.000000e+00f;
for (int ry_outer = 0; ry_outer < 17; ++ry_outer) {
for (int rx_outer = 0; rx_outer < 17; ++rx_outer) {
pad_temp_shared[0] = placeholder[((((((int)blockIdx.y) / 128) * 289) + (ry_outer * 17)) + rx_outer)];
placeholder_shared[0] = placeholder1[(((((int)blockIdx.y) * 289) + (ry_outer * 17)) + rx_outer)];
compute_local[0] = (compute_local[0] + (pad_temp_shared[0] * placeholder_shared[0]));
}
}
compute[((int)blockIdx.y)] = compute_local[0];
}
Process finished with exit code 1

It works fine with CPU.
@ vinx13 Could you give me some suggestions?
or Could anyone give me some suggestions? Thanks very much


#2

The default config for group conv is invalid because of too many blocks. You can tune with AutoTVM to get valid configs for operators


#3

Thanks for your reply, it does work.
One problem is that if the parameters of conv changes a little, such as only batch size, we need to tune valid schedule for the new workload again, or error may occur.
Do you have plan to optimize the implementation of conv2d_grad or default schedule for group conv?

Here are some example workloads fromn MXNet InceptionV3, the num_filter parameter of group conv increases to big size in these examples, and the default config doesn’t work for them:

verify_conv2d_grad((10, 80, 73, 73), (192, 80, 3, 3), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 288, 35, 35), (64, 288, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 288, 35, 35), (384, 288, 3, 3), (2, 2), (0, 0), (1, 1))
verify_conv2d_grad((10, 768, 17, 17), (192, 768, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 128, 17, 17), (128, 128, 1, 7), (1, 1), (0, 3), (1, 1))
verify_conv2d_grad((10, 160, 17, 17), (192, 160, 1, 7), (1, 1), (0, 3), (1, 1))
verify_conv2d_grad((10, 1280, 8, 8), (384, 1280, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 1280, 8, 8), (448, 1280, 1, 1), (1, 1), (0, 0), (1, 1))
verify_conv2d_grad((10, 448, 8, 8), (384, 448, 3, 3), (1, 1), (1, 1), (1, 1))

Hope this will help, thanks.


#4

We can add a fallback config, something like https://github.com/dmlc/tvm/blob/master/topi/python/topi/x86/conv2d.py#L39
This will make sure that default config is valid and you can run using it. But it is difficult to achieve good performance.


#5

OK, I will try that, thank you.