Error when Autotuning Yolo on CUDA

I am trying to compile yolo v3 tiny for a cuda target. However, I run into issues when trying to run inference.

WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 13, 13, 512), 'float32'), ('TENSOR', (1, 1, 512, 425), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 15, 15, 1024), 'float32'), ('TENSOR', (3, 3, 1024, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 15, 15, 512), 'float32'), ('TENSOR', (3, 3, 512, 1024), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 15, 15, 256), 'float32'), ('TENSOR', (3, 3, 256, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 28, 28, 128), 'float32'), ('TENSOR', (3, 3, 128, 256), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 54, 54, 64), 'float32'), ('TENSOR', (3, 3, 64, 128), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 106, 106, 32), 'float32'), ('TENSOR', (3, 3, 32, 64), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 210, 210, 16), 'float32'), ('TENSOR', (3, 3, 16, 32), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 418, 418, 3), 'float32'), ('TENSOR', (3, 3, 3, 16), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Traceback (most recent call last):
  File "./scripts/tvm_cli/tune_cuda.py", line 261, in <module>
    compile(info)
  File "./scripts/tvm_cli/tune_cuda.py", line 188, in compile
    module.run()
  File "/usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/contrib/graph_runtime.py", line 182, in run
    self._run()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 321, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 256, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 245, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (3) /usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/libtvm.so(TVMFuncCall+0x65) [0x7efb72d3c985]
  [bt] (2) /usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::detail::PackFuncVoidAddr_<8, tvm::runtime::CUDAWrappedFunc>(tvm::runtime::CUDAWrappedFunc, std::vector<tvm::runtime::detail::ArgConvertCode, std::allocator<tvm::runtime::detail::ArgConvertCode> > const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0xb6) [0x7efb72de8836]
  [bt] (1) /usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/libtvm.so(tvm::runtime::CUDAWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*, void**) const+0x567) [0x7efb72de83c7]
  [bt] (0) /usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/libtvm.so(+0x1043a02) [0x7efb72de4a02]
  File "../src/runtime/cuda/cuda_module.cc", line 190
  File "../src/runtime/library_module.cc", line 78
TVMError: Check failed: ret == 0 (-1 vs. 0) : CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
 grid=(2,1,173056),  block=(4,4,1)
// func_name=fused_nn_conv2d_subtract_divide_multiply_add_multiply_maximum_7_kernel0
// CUDA Source
// -----------
extern "C" __global__ void fused_nn_pad_1_kernel0(void* __restrict__ T_pad, void* __restrict__ placeholder) {
  ((float*)T_pad)[(((((int)blockIdx.x) * 1024) + ((int)threadIdx.x)))] = (((((15 <= ((int)blockIdx.x)) && (((int)blockIdx.x) < 210)) && (1 <= (((int)blockIdx.x) % 15))) && ((((int)blockIdx.x) % 15) < 14)) ? ((float*)placeholder)[((((((((int)blockIdx.x) / 15) * 13312) + ((((int)blockIdx.x) % 15) * 1024)) + ((int)threadIdx.x)) - 14336))] : 0.000000e+00f);
}

extern "C" __global__ void fused_nn_max_pool2d_kernel0(void* __restrict__ placeholder, void* __restrict__ tensor) {
  float tensor_local[1];
  tensor_local[(0)] = -3.402823e+38f;
  for (int rv = 0; rv < 2; ++rv) {
    for (int rv1 = 0; rv1 < 2; ++rv1) {
...

From various sources, it is suggested that I should run autotvm to find a runnable configuration. So I followed the tutorial here: https://tvm.apache.org/docs/tutorials/autotvm/tune_relay_cuda.html. However the autotuning process runs into issues.

Tuning...
[Task  1/17]  Current/Best:  112.75/ 330.70 GFLOPS | Progress: (100/100) | 251.02 s Done.
[Task  2/17]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (100/100) | 83.30 s Done.
[Task  3/17]  Current/Best: 1215.25/1867.56 GFLOPS | Progress: (100/100) | 108.53 s Done.
[Task  4/17]  Current/Best:  348.43/ 575.82 GFLOPS | Progress: (100/100) | 171.86 s Done.
Traceback (most recent call last):
  File "scripts/tvm_cli/tune_relay_cuda.py", line 380, in <module>
    tune_and_evaluate(tuning_option)
  File "scripts/tvm_cli/tune_relay_cuda.py", line 237, in tune_and_evaluate
    tune_tasks(tasks, **tuning_opt)
  File "scripts/tvm_cli/tune_relay_cuda.py", line 207, in tune_tasks
    tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
  File "/usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/model_based_tuner.py", line 288, in load_history
    success = base_model.fit_log(data_set, self.plan_size)
  File "/usr/local/lib/python3.6/dist-packages/tvm-0.7.dev1-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/xgboost_cost_model.py", line 244, in fit_log
    for x, y in res:
TypeError: 'NoneType' object is not iterable

Any Idea how to debug this?

Looks like an issue in XGBTuner. You can try to use random tuner to work around first.

cc @merrymercy

@comaniac Do you have any idea why the default config would not work out of the box?

Which GPU are you running? Default config is just a heuristic config without any consideration, so it might not fit the GPU you’re running.

I have GTX2080 with 8GB gram. So I hoped this network would fit, because this is the tiny version of yolo.

I am able to get past where it went wrong before with the random tuner. Thank you for your suggestion. However, I am getting some 0GFLOPS in some of the tasks. Is this of any concern? What are these tasks actually doing?

[Task  1/17]  Current/Best:  299.09/ 302.91 GFLOPS | Progress: (100/100) | 86.70 s Done.
[Task  2/17]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (100/100) | 73.79 s Done.
[Task  3/17]  Current/Best: 1483.78/1704.48 GFLOPS | Progress: (100/100) | 25.36 s Done.
[Task  4/17]  Current/Best:  303.86/ 571.06 GFLOPS | Progress: (100/100) | 146.23 s Done.
[Task  5/17]  Current/Best:    0.00/ 486.94 GFLOPS | Progress: (100/100) | 11.36 s Done.
[Task  6/17]  Current/Best:  384.00/ 547.98 GFLOPS | Progress: (100/100) | 158.49 s Done.
[Task  7/17]  Current/Best:   48.97/4226.84 GFLOPS | Progress: (100/100) | 132.57 s Done.
[Task  8/17]  Current/Best:  124.16/ 623.38 GFLOPS | Progress: (100/100) | 171.35 s Done.
[Task  9/17]  Current/Best: 1582.20/2292.12 GFLOPS | Progress: (100/100) | 86.78 s Done.
[Task 10/17]  Current/Best:  163.28/ 517.78 GFLOPS | Progress: (100/100) | 194.39 s Done.

The network size is not the most important issue you have encountered so far. It’s possible that even one conv2d cannot fit to the GPU due to the share memory and wrap size limitation. Those issues would be much less if you use V100, for example.

In your log, it means that all 100 trials were failed for task 2. You can check the tuning log for the error no. For tuning on GPU, we recommend at least 3,000 or 4,000 trials for each task to achieve reasonable performance. If you just want to get it work first regardless the performance, you may retune some tasks (e.g., task 2) for more trials until AutoTVM found at least one valid config.

I turned up the iterations. on task 2. In fact i set it to the same number as the entire config_space which is not that large ~1200. However not a single configuration worked.

How can I debug this further? what does the tasks correspond to? Do they correspond to different operators? Nodes in the graph? Knobs?

To answer my own question, I think 1 task correspond to conv_2d of a particular size in the graph.

I can see in error message of Task 2

TVMError: CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
 grid=(1,1,173056),  block=(16,8,1)

Task 2 havs a small exploration space because the third dimension of grid is always stuck at 173056. while the first dimension of grid only flips between 1 and 2.

What does this grid mean?

No idea about grid and block. What’s the task name of task 2?

Task 2 did not find any solution did not seem to affect the autotvm process. it successfully generated a log file. I was able to run inference as well. (but it was slower than the gpu)

This is the name of the task 2:

Task(func_name=conv2d_nhwc.cuda, args=(('TENSOR', (1, 15, 15, 1024), 'float32'), ('TENSOR', (3, 3, 1024, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nhwc.cuda', ('TENSOR', (1, 15, 15, 1024), 'float32'), ('TENSOR', (3, 3, 1024, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'))

And this is the final generated log:

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 418, 418, 3], "float32"], ["TENSOR", [3, 3, 3, 16], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 9569422382, "code_hash": null, "entity": [["thread_num_inverse", "ot", 64], ["thread_num_data", "ot", 32], ["thread_num_kernel", "ot", 1], ["offset_inverse", "ot", 4], ["offset_data", "ot", 4], ["offset_kernel", "ot", 2], ["inverse_in_vector", "ot", 2], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 2, 8, 2]], ["tile_x", "sp", [-1, 4, 2, 1]], ["tile_rc", "sp", [-1, 3]], ["offset_bgemm", "ot", 8], ["vector_bgemm", "ot", 1]]}, "result": [[0.00045494442771084335], 0, 1.6892707347869873, 1596808669.6243942], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_tensorcore.cuda", [["TENSOR", [1, 210, 210, 16], "float32"], ["TENSOR", [3, 3, 16, 32], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 2322597763, "code_hash": null, "entity": [["thread_num_inverse", "ot", 128], ["thread_num_data", "ot", 64], ["thread_num_kernel", "ot", 1], ["offset_inverse", "ot", 0], ["offset_data", "ot", 2], ["offset_kernel", "ot", 2], ["inverse_in_vector", "ot", 1], ["block_row_warps", "ot", 1], ["block_col_warps", "ot", 1], ["warp_row_tiles", "ot", 2], ["warp_col_tiles", "ot", 1], ["chunk", "ot", 1], ["offset", "ot", 8], ["offsetCS", "ot", 8], ["vec", "ot", 1], ["wmma_m", "ot", 32]]}, "result": [[0.00019930246134868421], 0, 2.1464030742645264, 1596808766.680427], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_tensorcore.cuda", [["TENSOR", [1, 106, 106, 32], "float32"], ["TENSOR", [3, 3, 32, 64], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 44075756, "code_hash": null, "entity": [["thread_num_inverse", "ot", 32], ["thread_num_data", "ot", 32], ["thread_num_kernel", "ot", 1], ["offset_inverse", "ot", 2], ["offset_data", "ot", 4], ["offset_kernel", "ot", 4], ["inverse_in_vector", "ot", 2], ["block_row_warps", "ot", 1], ["block_col_warps", "ot", 1], ["warp_row_tiles", "ot", 1], ["warp_col_tiles", "ot", 1], ["chunk", "ot", 2], ["offset", "ot", 8], ["offsetCS", "ot", 0], ["vec", "ot", 1], ["wmma_m", "ot", 16]]}, "result": [[0.00010162571786690976], 0, 1.9622037410736084, 1596808931.107104], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 54, 54, 64], "float32"], ["TENSOR", [3, 3, 64, 128], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 11164870461, "code_hash": null, "entity": [["thread_num_inverse", "ot", 32], ["thread_num_data", "ot", 64], ["thread_num_kernel", "ot", 128], ["offset_inverse", "ot", 4], ["offset_data", "ot", 4], ["offset_kernel", "ot", 2], ["inverse_in_vector", "ot", 2], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 4, 13]], ["tile_x", "sp", [-1, 1, 16, 4]], ["tile_rc", "sp", [-1, 4]], ["offset_bgemm", "ot", 2], ["vector_bgemm", "ot", 2]]}, "result": [[7.327778189550426e-05], 0, 2.0689899921417236, 1596809138.3889282], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 28, 28, 128], "float32"], ["TENSOR", [3, 3, 128, 256], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 844580911, "code_hash": null, "entity": [["thread_num_inverse", "ot", 32], ["thread_num_data", "ot", 64], ["thread_num_kernel", "ot", 32], ["offset_inverse", "ot", 4], ["offset_data", "ot", 4], ["offset_kernel", "ot", 1], ["inverse_in_vector", "ot", 1], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 13, 1, 1]], ["tile_x", "sp", [-1, 2, 16, 2]], ["tile_rc", "sp", [-1, 16]], ["offset_bgemm", "ot", 4], ["vector_bgemm", "ot", 1]]}, "result": [[9.385109207459207e-05], 0, 2.265333890914917, 1596809475.2130105], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 28, 28, 128], "float32"], ["TENSOR", [3, 3, 128, 256], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 230, "code_hash": null, "entity": [["tile_n", "ot", 8], ["tile_c", "ot", 4], ["num_thread_n", "ot", 8], ["num_thread_c", "ot", 16], ["vthread_n", "ot", 1], ["vthread_c", "ot", 2], ["step", "ot", 16]]}, "result": [[0.0007314728575757575], 0, 2.0213303565979004, 1596809609.9690342], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 15, 15, 256], "float32"], ["TENSOR", [3, 3, 256, 512], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 678, "code_hash": null, "entity": [["tile_n", "ot", 2], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 1], ["vthread_c", "ot", 1], ["step", "ot", 32]]}, "result": [[0.0007752656313131312], 0, 1.6025142669677734, 1596809818.1569047], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 15, 15, 512], "float32"], ["TENSOR", [3, 3, 512, 1024], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 7369000834, "code_hash": null, "entity": [["thread_num_inverse", "ot", 256], ["thread_num_data", "ot", 32], ["thread_num_kernel", "ot", 128], ["offset_inverse", "ot", 4], ["offset_data", "ot", 0], ["offset_kernel", "ot", 4], ["inverse_in_vector", "ot", 4], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 7, 7]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 8]], ["offset_bgemm", "ot", 8], ["vector_bgemm", "ot", 4]]}, "result": [[0.00031195939096774196], 0, 2.1854963302612305, 1596810081.897409], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 15, 15, 1024], "float32"], ["TENSOR", [3, 3, 1024, 512], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 92423074, "code_hash": null, "entity": [["thread_num_inverse", "ot", 256], ["thread_num_data", "ot", 256], ["thread_num_kernel", "ot", 64], ["offset_inverse", "ot", 2], ["offset_data", "ot", 1], ["offset_kernel", "ot", 4], ["inverse_in_vector", "ot", 2], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 7, 7, 1]], ["tile_x", "sp", [-1, 2, 4, 2]], ["tile_rc", "sp", [-1, 4]], ["offset_bgemm", "ot", 0], ["vector_bgemm", "ot", 1]]}, "result": [[0.0003508235021770682], 0, 2.213620901107788, 1596810365.3167224], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 15, 15, 1024], "float32"], ["TENSOR", [3, 3, 1024, 512], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 202, "code_hash": null, "entity": [["tile_n", "ot", 4], ["tile_c", "ot", 4], ["num_thread_n", "ot", 8], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 1], ["vthread_c", "ot", 2], ["step", "ot", 16]]}, "result": [[0.003418024622222222], 0, 1.7397112846374512, 1596810564.309492], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 210, 210, 16], "float32"], ["TENSOR", [3, 3, 16, 32], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 274, "code_hash": null, "entity": [["tile_n", "ot", 4], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 2], ["vthread_c", "ot", 2], ["step", "ot", 16]]}, "result": [[0.0006982166502890174], 0, 2.1854822635650635, 1596752251.0586443], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 106, 106, 32], "float32"], ["TENSOR", [3, 3, 32, 64], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 111, "code_hash": null, "entity": [["tile_n", "ot", 2], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 2], ["vthread_c", "ot", 1], ["step", "ot", 16]]}, "result": [[0.0007276232560386474], 0, 1.6302690505981445, 1596752496.2037885], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 54, 54, 64], "float32"], ["TENSOR", [3, 3, 64, 128], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 301, "code_hash": null, "entity": [["tile_n", "ot", 4], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 16], ["vthread_n", "ot", 2], ["vthread_c", "ot", 2], ["step", "ot", 16]]}, "result": [[0.0006396164261603376], 0, 1.6311135292053223, 1596752778.9645839], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc_winograd_direct.cuda", [["TENSOR", [1, 15, 15, 256], "float32"], ["TENSOR", [3, 3, 256, 512], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 2661384769, "code_hash": null, "entity": [["thread_num_inverse", "ot", 256], ["thread_num_data", "ot", 128], ["thread_num_kernel", "ot", 1], ["offset_inverse", "ot", 0], ["offset_data", "ot", 2], ["offset_kernel", "ot", 0], ["inverse_in_vector", "ot", 4], ["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 7, 7, 1]], ["tile_x", "sp", [-1, 1, 64, 1]], ["tile_rc", "sp", [-1, 16]], ["offset_bgemm", "ot", 2], ["vector_bgemm", "ot", 2]]}, "result": [[8.728379057971015e-05], 0, 2.2499608993530273, 1596753135.8903747], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 15, 15, 512], "float32"], ["TENSOR", [3, 3, 512, 1024], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 113, "code_hash": null, "entity": [["tile_n", "ot", 8], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 2], ["vthread_c", "ot", 1], ["step", "ot", 16]]}, "result": [[0.003101637875], 0, 3.969357490539551, 1596753684.8480952], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nhwc.cuda", [["TENSOR", [1, 13, 13, 512], "float32"], ["TENSOR", [1, 1, 512, 425], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 760, "code_hash": null, "entity": [["tile_n", "ot", 4], ["tile_c", "ot", 4], ["num_thread_n", "ot", 4], ["num_thread_c", "ot", 8], ["vthread_n", "ot", 2], ["vthread_c", "ot", 1], ["step", "ot", 32]]}, "result": [[0.00023400437094281296], 0, 1.702641248703003, 1596754126.4415483], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 256, 14, 14], "float32"], ["TENSOR", [512, 256, 1, 1], "float32"], [2, 2], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 143261, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 32, 1]], ["tile_y", "sp", [-1, 1, 1, 7]], ["tile_x", "sp", [-1, 1, 7, 1]], ["tile_rc", "sp", [-1, 16]], ["tile_ry", "sp", [-1, 1]], ["tile_rx", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[3.086108868265017e-05], 0, 2.1766304969787598, 1596638736.6176426], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 128, 28, 28], "float32"], ["TENSOR", [256, 128, 1, 1], "float32"], [2, 2], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 1116793, "code_hash": null, "entity": [["tile_f", "sp", [-1, 4, 16, 2]], ["tile_y", "sp", [-1, 1, 1, 1]], ["tile_x", "sp", [-1, 2, 7, 1]], ["tile_rc", "sp", [-1, 4]], ["tile_ry", "sp", [-1, 1]], ["tile_rx", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}, "result": [[2.1558027118334548e-05], 0, 2.02054500579834, 1596638875.932826], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [128, 64, 1, 1], "float32"], [2, 2], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 1099263, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 64, 2]], ["tile_y", "sp", [-1, 1, 1, 1]], ["tile_x", "sp", [-1, 1, 7, 4]], ["tile_rc", "sp", [-1, 32]], ["tile_ry", "sp", [-1, 1]], ["tile_rx", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}, "result": [[1.8940856249498918e-05], 0, 2.005221366882324, 1596639012.0224953], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [64, 64, 1, 1], "float32"], [1, 1], [0, 0, 0, 0], [1, 1], "float32"], {}], "config": {"index": 20053817, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 2, 16]], ["tile_y", "sp", [-1, 2, 4, 1]], ["tile_x", "sp", [-1, 1, 14, 1]], ["tile_rc", "sp", [-1, 4]], ["tile_ry", "sp", [-1, 1]], ["tile_rx", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[1.3704667724308127e-05], 0, 2.1790435314178467, 1596639247.948683], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3, 3, 3], [1, 1], "float32"], {}], "config": {"index": 63835487, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 16, 1]], ["tile_y", "sp", [-1, 4, 7, 1]], ["tile_x", "sp", [-1, 1, 1, 8]], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 7]], ["tile_rx", "sp", [-1, 7]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[8.191113981042654e-05], 0, 2.6882503032684326, 1596639396.3446727], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [64, 64, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 97438, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 2, 32]], ["tile_x", "sp", [-1, 1, 49, 4]], ["tile_rc", "sp", [-1, 16]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 0]]}, "result": [[3.224812635135135e-05], 0, 4.217549562454224, 1596639608.6664226], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [64, 64, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 57167252, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 8, 2]], ["tile_y", "sp", [-1, 4, 1, 1]], ["tile_x", "sp", [-1, 1, 28, 1]], ["tile_rc", "sp", [-1, 2]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}, "result": [[6.48782944519621e-05], 0, 2.0824203491210938, 1596639986.995053], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 64, 56, 56], "float32"], ["TENSOR", [128, 64, 3, 3], "float32"], [2, 2], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 25992895, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 8, 2]], ["tile_y", "sp", [-1, 2, 2, 1]], ["tile_x", "sp", [-1, 1, 14, 1]], ["tile_rc", "sp", [-1, 4]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[6.261991542678152e-05], 0, 2.3594276905059814, 1596640424.8481562], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 128, 28, 28], "float32"], ["TENSOR", [128, 128, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 125218, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 16, 2]], ["tile_x", "sp", [-1, 7, 2, 2]], ["tile_rc", "sp", [-1, 4]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 0]]}, "result": [[3.978000149055979e-05], 0, 4.92205023765564, 1596640735.21119], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 128, 28, 28], "float32"], ["TENSOR", [128, 128, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 29448294, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 8, 2]], ["tile_y", "sp", [-1, 4, 1, 1]], ["tile_x", "sp", [-1, 1, 14, 1]], ["tile_rc", "sp", [-1, 2]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[7.648591985951469e-05], 0, 2.093278408050537, 1596641115.4107127], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 128, 28, 28], "float32"], ["TENSOR", [256, 128, 3, 3], "float32"], [2, 2], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 7875316, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 16, 1]], ["tile_y", "sp", [-1, 2, 1, 1]], ["tile_x", "sp", [-1, 2, 7, 1]], ["tile_rc", "sp", [-1, 4]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[7.978286595531844e-05], 0, 2.290926694869995, 1596641311.853063], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 256, 14, 14], "float32"], ["TENSOR", [256, 256, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 81723, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 8, 1, 2]], ["tile_x", "sp", [-1, 1, 49, 1]], ["tile_rc", "sp", [-1, 16]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[5.0846598899936535e-05], 0, 2.4042413234710693, 1596641706.4003792], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 256, 14, 14], "float32"], ["TENSOR", [256, 256, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 8799805, "code_hash": null, "entity": [["tile_f", "sp", [-1, 2, 8, 1]], ["tile_y", "sp", [-1, 1, 2, 1]], ["tile_x", "sp", [-1, 7, 2, 1]], ["tile_rc", "sp", [-1, 2]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[0.00011983135475834579], 0, 2.174240827560425, 1596641921.8496149], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 256, 14, 14], "float32"], ["TENSOR", [512, 256, 3, 3], "float32"], [2, 2], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 608554, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 16, 1]], ["tile_y", "sp", [-1, 1, 7, 1]], ["tile_x", "sp", [-1, 1, 1, 7]], ["tile_rc", "sp", [-1, 2]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}, "result": [[0.00017544716376811595], 0, 2.2342605590820312, 1596642145.242544], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw_winograd.cuda", [["TENSOR", [1, 512, 7, 7], "float32"], ["TENSOR", [512, 512, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 414729, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 4, 8, 1]], ["tile_x", "sp", [-1, 1, 4, 4]], ["tile_rc", "sp", [-1, 8]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[5.793594312796208e-05], 0, 2.0294864177703857, 1596642299.7055194], "version": 0.2, "tvm_version": "0.7.dev1"}

{"input": ["cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -thread_warp_size=32", "conv2d_nchw.cuda", [["TENSOR", [1, 512, 7, 7], "float32"], ["TENSOR", [512, 512, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 818654, "code_hash": null, "entity": [["tile_f", "sp", [-1, 1, 16, 1]], ["tile_y", "sp", [-1, 7, 1, 1]], ["tile_x", "sp", [-1, 1, 7, 1]], ["tile_rc", "sp", [-1, 4]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[0.00017657052846715328], 0, 2.1720330715179443, 1596642561.9635131], "version": 0.2, "tvm_version": "0.7.dev1"}

I can see the result of task 2 in the final log. This is strange as task 2 did not find any valid configuration during autotvm. So the optimization algorithm just picked something even though nothing worked? and it happened to be a actually valid configuration?

I think I found some answer:

This is task 3

Task(func_name=conv2d_nhwc_winograd_direct.cuda, args=(('TENSOR', (1, 15, 15, 1024), 'float32'), ('TENSOR', (3, 3, 1024, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'), kwargs={}, workload=('conv2d_nhwc_winograd_direct.cuda', ('TENSOR', (1, 15, 15, 1024), 'float32'), ('TENSOR', (3, 3, 1024, 512), 'float32'), (1, 1), (0, 0, 0, 0), (1, 1), 'float32'))

This is the same size and shape operation as Task 2 but with winograd_direct optimisation. Hence during inference, this winograd_direct implementation will always be favoured over the non-optimized operator in Task 2. Hence inference always succeeds.

If this is the case, then it would be wrong to put the optimisation result of task 2 into the final log as it does not represent a valid configuration. @comaniac do you think this is a bug?

Secondly conv2d_nhwc.cuda does not seem to be able to handle this particular size and shape, is this an bug?

Your investigations are correct. Based on my understanding, TVM currently doesn’t optimize the NHWC layout and it might be the reason of all issues you have encountered. You can try to use a Relay pass ConvertLayout to change the layout to NCHW before running AutoTVM and build pipeline. This pass will insert two layout transform operators in the beginning and the end of the model to make all computations on NCHW.