Any clues on troubleshooting the following AutoTuning failure?
The scenario is as follows:
-
I performed the following setup, running everything in a docker container (https://hub.docker.com/r/mevermeulen/rocm-tvm:3.0) a. One process runs rpc_tracker b. One process runs rpc_server c. One process runs tutorials/autotvm/tune_relay_cuda.py with three changes (i) use rocm target and (ii) uncomment tune_and_execute call (iii) also tune the dense node.
-
This scenario seems to start working fine and I get the following progress markers as things are tuned:
Extract tasks...
...68%, 0.08 MB, 196 KB/s, 0 seconds passed
...100%, 0.12 MB, 319 KB/s, 0 seconds passed
Tuning...
[Task 1/13] Current/Best: 134.59/ 236.82 GFLOPS | Progress: (256/2000) | 326.89 s
[Task 1/13] Current/Best: 130.54/ 389.97 GFLOPS | Progress: (512/2000) | 994.32 s
[Task 1/13] Current/Best: 110.37/ 389.97 GFLOPS | Progress: (832/2000) | 1677.04 s
[Task 1/13] Current/Best: 83.88/ 389.97 GFLOPS | Progress: (1088/2000) | 2155.99 s Done.
[Task 2/13] Current/Best: 159.34/ 602.14 GFLOPS | Progress: (256/2000) | 421.93 s
[Task 2/13] Current/Best: 267.45/ 602.14 GFLOPS | Progress: (512/2000) | 1255.12 s
[Task 2/13] Current/Best: 297.13/ 602.14 GFLOPS | Progress: (704/2000) | 1843.77 s Done.
[Task 3/13] Current/Best: 242.40/ 631.49 GFLOPS | Progress: (256/2000) | 585.56 s
[Task 3/13] Current/Best: 0.00/ 675.91 GFLOPS | Progress: (512/2000) | 1518.75 s
- At the next line, something has gone wrong, and I get the following output
[Task 3/13] Current/Best: 0.00/ 675.91 GFLOPS | Progress: (704/2000) | 1928.00 sWARNING:autotvm:To
o many errors happen in the tuning. Now is in debug mode
DEBUG:autotvm:No: 705 GFLOPS: 0.00/675.91 result: MeasureResult(costs=(TimeoutError(),), error_no
=6, all_cost=10, timestamp=1578441854.4124289) [('tile_f', [-1, 1, 64, 1]), ('tile_y', [-1, 2, 1, 2]),
('tile_x', [-1, 1, 4, 1]), ('tile_rc', [-1, 4]), ('tile_ry', [-1, 1]), ('tile_rx', [-1, 1]), ('auto_un
roll_max_step', 0), ('unroll_explicit', 1)],direct,None,434313
DEBUG:autotvm:No: 706 GFLOPS: 0.00/675.91 result: MeasureResult(costs=(TimeoutError(),), error_no
=6, all_cost=10, timestamp=1578441854.4726968) [('tile_f', [-1, 1, 128, 1]), ('tile_y', [-1, 1, 1, 4])
, ('tile_x', [-1, 1, 4, 1]), ('tile_rc', [-1, 4]), ('tile_ry', [-1, 1]), ('tile_rx', [-1, 1]), ('auto_u
nroll_max_step', 0), ('unroll_explicit', 1)],direct,None,435275
- This continues for a while with individual items until autotvm crashes
Traceback (most recent call last):
File "tune_relay_rocm.py", line 260, in <module>
tune_and_evaluate(tuning_option)
File "tune_relay_rocm.py", line 229, in tune_and_evaluate
tune_tasks(tasks, **tuning_opt)
File "tune_relay_rocm.py", line 210, in tune_tasks
autotvm.callback.log_to_file(tmp_log_file)])
File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/xgboost_tuner.py", line 86, in tune
super(XGBTuner, self).tune(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/tuner/tuner.py", line 108, in tune
measure_batch = create_measure_batch(self.task, measure_option)
File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/measure/measure.py", line 252, in create_measure_batch
attach_objects = runner.set_task(task)
File "/usr/local/lib/python3.6/dist-packages/tvm-0.6.0-py3.6-linux-x86_64.egg/tvm/autotvm/measure/measure_methods.py", line 211, in set_task
raise RuntimeError("Cannot get remote devices from the tracker. "
RuntimeError: Cannot get remote devices from the tracker. Please check the status of tracker by 'python -m tvm.exec.query_rpc_tracker --port [THE PORT YOU USE]' and make sure y
ou have free devices on the queue status.
- At this point, I can query the tracker and it looks like it responds:
prompt% python3 -m tvm.exec.query_rpc_tracker --host 0.0.0.0 --port 9190
Tracker address 0.0.0.0:9190
Server List
----------------------------
server-address key
----------------------------
127.0.0.1:60858 server:1080ti
----------------------------
Queue Status
------------------------------
key total free pending
------------------------------
1080ti 1 1 0
------------------------------
- However, the system is also not quite right:
prompt% uptime
16:38:22 up 19:22, 0 users, load average: 692.85, 693.10, 693.03
prompt% ps aux | grep tune_relay_cuda.py | wc -l
634
The symptoms appear as if the following happened: (a) Autotuning works fine for an hour and a half (a) Some unknown condition or set of failures threw autotvm into debug mode (b) Once in debug mode, there is some recursive loop in the python calls such that the process is repeatedly forked © Some point, this scenario resulted in being unable to contact the tracker/server. The tracker/server does seem to be present, but perhaps the extra processes or other things have confused things.
I’m not quite sure on how to troubleshoot this one. For example, is there a place to look to find those “too many autotvm failures”? that seem to throw things in a different mode?