[AutoTuning] How to debug when all trials are failing on GPU


#1

I’m trying to auto-tune models running on GPUs. However, the GFLOPS of all trials is 0.0 and each trial reports different errors.

To verify if this issue is model specific, I have also tried the example provided here: tutorials/autotvm/tune_conv2d_cuda.py, with n_trial set to 2000 and timeout set to 100.

My question is: How to interpret the results that all trials are failing and how to debug what causes the issue? I notice the topi/recipe/gemm/gemm_int8.py, so perhaps what I can do is to try if the same issue happens with just GEMM. Is there an autotuning gemm example for float32 on GPUs or can you provide some guidance on how to modify gemm_int8 to work with float32?


measure_option = autotvm.measure_option(
builder=autotvm.LocalBuilder(),
runner=autotvm.LocalRunner(repeat=3, min_repeat_ms=100, timeout=100)
)

tuner = autotvm.tuner.XGBTuner(task)
tuner.tune(n_trial=2000,
measure_option=measure_option,
callbacks=[autotvm.callback.log_to_file(‘conv2d.log’)])

Trace log:
~/workspace/TVM/tutorials/autotvm$ python3 tune_conv2d_cuda.py
ConfigSpace (len=10454400, space_map=
0 tile_f: Split(policy=all, product=512, num_outputs=4) len=220
1 tile_y: Split(policy=all, product=7, num_outputs=4) len=4
2 tile_x: Split(policy=all, product=7, num_outputs=4) len=4
3 tile_rc: Split(policy=all, product=512, num_outputs=3) len=55
4 tile_ry: Split(policy=all, product=3, num_outputs=3) len=3
5 tile_rx: Split(policy=all, product=3, num_outputs=3) len=3
6 auto_unroll_max_step: OtherOption([0, 512, 1500]) len=3
7 unroll_explicit: OtherOption([0, 1]) len=2
)
Get devices for measurement successfully!
No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n [bt] (8) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x101a1b8) [0x7fdf113ba1b8]\n [bt] (7) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x1019f18) [0x7fdf113b9f18]\n [bt] (6) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x1020848) [0x7fdf113c0848]\n [bt] (5) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x1020061) [0x7fdf113c0061]\n [bt] (4) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x101ec55) [0x7fdf113bec55]\n [bt] (3) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x10182b7) [0x7fdf113b82b7]\n [bt] (2) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x101e590) [0x7fdf113be590]\n [bt] (1) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0x1019259) [0x7fdf113b9259]\n [bt] (0) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0xff816b) [0x7fdf1139816b]\n File “tvm/_ffi/_cython/./function.pxi”, line 56, in tvm._ffi._cy3.core.tvm_callback\n File "/home/minjiaz/workspace/tvm_public/tvm/py’,),), error_no=4, all_cost=0.8545262813568115, timestamp=1559694901.713304) [(‘tile_f’, [2, 2, 4, 32]), (‘tile_y’, [1, 7, 1, 1]), (‘tile_x’, [7, 1, 1, 1]), (‘tile_rc’, [128, 1, 4]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 3, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],None,5875295
|—|---|—|---|
No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (1) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7f2496145ef1]\n [bt] (0) /home/minjiaz/workspace/tvm_public/tvm/build/libtvm.so(+0xff816b) [0x7f249614116b]\n File “tvm/_ffi/_cython/./function.pxi”, line 56, in tvm._ffi._cy3.core.tvm_callback\n File “/home/minjiaz/workspace/tvm_public/tvm/python/tvm/autotvm/measure/measure_methods.py”, line 596, in verify_pass\n raise InstantiationError(“Skipped because of invalid gpu kernel”)\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel’,),), error_no=1, all_cost=0.09589171409606934, timestamp=1559694893.6797783) [(‘tile_f’, [1, 2, 128, 2]), (‘tile_y’, [7, 1, 1, 1]), (‘tile_x’, [1, 7, 1, 1]), (‘tile_rc’, [8, 4, 16]), (‘tile_ry’, [3, 1, 1]), (‘tile_rx’, [1, 1, 3]), (‘auto_unroll_max_step’, 1500), (‘unroll_explicit’, 1)],None,10001298


Too many errors happen in the tuning. Now is in debug mode
Too many errors happen in the tuning. Now is in debug mode
WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode

DEBUG:autotvm:Finish loading 2000 records
WARNING:****autotvm:Cannot find config for target=cuda, workload=(‘conv2d_no_batching’, 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)). A fallback configuration is used, which may bring great performance regression.

Best config:
,None,None
Finish loading 2000 records
Finish loading 2000 records
Time cost of this operator: 0.003547

conv2d.log:
{“r”: [[1000000000.0], 4, 0.8545262813568115, 1559694901.713304], “v”: 0.1, “i”: [“cuda”, “conv2d_no_batching”, [1, 7, 7, 512, 512, 3, 3, [1, 1], [1, 1]], {}, [“conv2d_no_batching”, 1, 7, 7, 512, 512, 3, 3, [1, 1], [1, 1]], {“t”: “”, “c”: null, “e”: [[“tile_f”, “sp”, [2, 2, 4, 32]], [“tile_y”, “sp”, [1, 7, 1, 1]], [“tile_x”, “sp”, [7, 1, 1, 1]], [“tile_rc”, “sp”, [128, 1, 4]], [“tile_ry”, “sp”, [3, 1, 1]], [“tile_rx”, “sp”, [1, 3, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]], “i”: 5875295}]}
{“r”: [[1000000000.0], 4, 2.5904312133789062, 1559694902.1608183], “v”: 0.1, “i”: [“cuda”, “conv2d_no_batching”, [1, 7, 7, 512, 512, 3, 3, [1, 1], [1, 1]], {}, [“conv2d_no_batching”, 1, 7, 7, 512, 512, 3, 3, [1, 1], [1, 1]], {“t”: “”, “c”: null, “e”: [[“tile_f”, “sp”, [4, 2, 8, 8]], [“tile_y”, “sp”, [1, 7, 1, 1]], [“tile_x”, “sp”, [7, 1, 1, 1]], [“tile_rc”, “sp”, [128, 2, 2]], [“tile_ry”, “sp”, [1, 1, 3]], [“tile_rx”, “sp”, [1, 3, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]], “i”: 9719095}]}


#2

Can you check if running a standalone function works on the GPU? e.g., a tutorial like https://docs.tvm.ai/tutorials/optimize/opt_conv_cuda.html


#3

Thanks, eqy. The issue was caused by the incompatibility issue from LLVM7.0. Changing LLVM to a lower version solved the issue.


#4

Which LLVM version was working in the end?