[AutoTuning] How to debug a autotvm failure?

Any clues on troubleshooting the following AutoTuning failure?

The scenario is as follows:

  1. I performed the following setup, running everything in a docker container (https://hub.docker.com/r/mevermeulen/rocm-tvm:3.0) a. One process runs rpc_tracker b. One process runs rpc_server c. One process runs tutorials/autotvm/tune_relay_cuda.py with three changes (i) use rocm target and (ii) uncomment tune_and_execute call (iii) also tune the dense node.

  2. This scenario seems to start working fine and I get the following progress markers as things are tuned:

Extract tasks...
...68%, 0.08 MB, 196 KB/s, 0 seconds passed
...100%, 0.12 MB, 319 KB/s, 0 seconds passed
Tuning...
[Task  1/13]  Current/Best:  134.59/ 236.82 GFLOPS | Progress: (256/2000) | 326.89 s
[Task  1/13]  Current/Best:  130.54/ 389.97 GFLOPS | Progress: (512/2000) | 994.32 s
[Task  1/13]  Current/Best:  110.37/ 389.97 GFLOPS | Progress: (832/2000) | 1677.04 s
[Task  1/13]  Current/Best:   83.88/ 389.97 GFLOPS | Progress: (1088/2000) | 2155.99 s Done.
[Task  2/13]  Current/Best:  159.34/ 602.14 GFLOPS | Progress: (256/2000) | 421.93 s
[Task  2/13]  Current/Best:  267.45/ 602.14 GFLOPS | Progress: (512/2000) | 1255.12 s
[Task  2/13]  Current/Best:  297.13/ 602.14 GFLOPS | Progress: (704/2000) | 1843.77 s Done.
[Task  3/13]  Current/Best:  242.40/ 631.49 GFLOPS | Progress: (256/2000) | 585.56 s
[Task  3/13]  Current/Best:    0.00/ 675.91 GFLOPS | Progress: (512/2000) | 1518.75 s
  1. At the next line, something has gone wrong, and I get the following output
[Task  3/13]  Current/Best:    0.00/ 675.91 GFLOPS | Progress: (704/2000) | 1928.00 sWARNING:autotvm:To
o many errors happen in the tuning. Now is in debug mode
DEBUG:autotvm:No: 705	GFLOPS: 0.00/675.91	result: MeasureResult(costs=(TimeoutError(),), error_no
=6, all_cost=10, timestamp=1578441854.4124289)	[('tile_f', [-1, 1, 64, 1]), ('tile_y', [-1, 2, 1, 2]),
 ('tile_x', [-1, 1, 4, 1]), ('tile_rc', [-1, 4]), ('tile_ry', [-1, 1]), ('tile_rx', [-1, 1]), ('auto_un
roll_max_step', 0), ('unroll_explicit', 1)],direct,None,434313
DEBUG:autotvm:No: 706	GFLOPS: 0.00/675.91	result: MeasureResult(costs=(TimeoutError(),), error_no
=6, all_cost=10, timestamp=1578441854.4726968)	[('tile_f', [-1, 1, 128, 1]), ('tile_y', [-1, 1, 1, 4])
, ('tile_x', [-1, 1, 4, 1]), ('tile_rc', [-1, 4]), ('tile_ry', [-1, 1]), ('tile_rx', [-1, 1]), ('auto_u
nroll_max_step', 0), ('unroll_explicit', 1)],direct,None,435275
  1. This continues for a while with individual items until autotvm crashes
Traceback (most recent call last):

  File "tune_relay_rocm.py", line 260, in <module>
    tune_and_evaluate(tuning_option)

  File "tune_relay_rocm.py", line 229, in tune_and_evaluate
    tune_tasks(tasks, **tuning_opt)

  File "tune_relay_rocm.py", line 210, in tune_tasks
    autotvm.callback.log_to_file(tmp_log_file)])

  File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/xgboost_tuner.py", line 86, in tune
    super(XGBTuner, self).tune(*args, **kwargs)

  File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/tuner.py", line 108, in tune
    measure_batch = create_measure_batch(self.task, measure_option)

  File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/measure/measure.py", line 252, in create_measure_batch
    attach_objects = runner.set_task(task)

  File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/measure/measure_methods.py", line 211, in set_task
    raise RuntimeError("Cannot get remote devices from the tracker. "

RuntimeError: Cannot get remote devices from the tracker. Please check the status of tracker by 'python -m tvm.exec.query_rpc_tracker --port [THE PORT YOU USE]' and make sure y
ou have free devices on the queue status.
  1. At this point, I can query the tracker and it looks like it responds:
prompt% python3 -m tvm.exec.query_rpc_tracker --host 0.0.0.0 --port 9190
Tracker address 0.0.0.0:9190

Server List
----------------------------
server-address	key
----------------------------
127.0.0.1:60858	server:1080ti
----------------------------

Queue Status
------------------------------
key      total  free  pending
------------------------------
1080ti   1      1     0      
------------------------------
  1. However, the system is also not quite right:
prompt% uptime
 16:38:22 up 19:22,  0 users,  load average: 692.85, 693.10, 693.03

prompt% ps aux | grep tune_relay_cuda.py | wc -l
634

The symptoms appear as if the following happened: (a) Autotuning works fine for an hour and a half (a) Some unknown condition or set of failures threw autotvm into debug mode (b) Once in debug mode, there is some recursive loop in the python calls such that the process is repeatedly forked © Some point, this scenario resulted in being unable to contact the tracker/server. The tracker/server does seem to be present, but perhaps the extra processes or other things have confused things.

I’m not quite sure on how to troubleshoot this one. For example, is there a place to look to find those “too many autotvm failures”? that seem to throw things in a different mode?

One additional data point: I ran this on a different GPU configuration, using the identical docker image and script. This time:

  1. Task 3/13 was properly tuned and didn’t result in going into debug mode.
  2. Task 6/13 showed similar initial symptoms, progress checkpoints saying 0.00/0.00 GFLOPS followed by switching to debug mode for the rest of the run.
  3. Debug mode continued to end of the run and I received a set of tuning results.

At this point, I assume there are two different issues described here:

  1. The conv2d schedule I am using for topi_nn_conv2d is invalid for configuration 3/13 on the hardware - so I need to look at the definitions of those schedules. Switching to debug mode is a symptom of those invalid schedules. Next steps is looking further at the schedules.
  2. There is an independent issue going on when the 3/13 system switches to debug mode. Fixing the schedule might sidestep the issue. It also doesn’t happen every time as evidenced by the 6/13 switch to debug mode on other hardware that does get to completion.