[solved][AutoTVM] Cannot get remote devices from the tracker


#1

Hi, I am running autotvm example for mobile gpu on android, it runs for a while but at some point I get a runtime error (I’ve turned on the debug log output, set n_trial = 20 also):

Extract tasks...
Tuning...
INFO:autotvm:Get devices for measurement successfully!
DEBUG:autotvm:No: 1	GFLOPS: 6.20/6.20	result: MeasureResult(costs=(0.0372740654,), error_no=0, all_cost=3.4236087799072266, timestamp=1558606325.1270492)	[('tile_bna', 16), ('tile_bnb', 2), ('tile_t1', [16, 4]), ('tile_t2', [16, 4]), ('c_unroll', [16, 4]), ('yt', 8)],winograd,None,17559
DEBUG:autotvm:No: 2	GFLOPS: 0.00/6.20	result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (3) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7fe0569097d1]\n  [bt] (2) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x9e486b) [0x7fe05694c86b]\n  [bt] (1) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x9d9db7) [0x7fe056941db7]\n  [bt] (0) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x172ab2) [0x7fe0560daab2]\n  File "/home/SERILOCAL/n.perto/Documents/tvm/src/runtime/rpc/rpc_session.cc", line 962\nTVMError: Check failed: code == RPCCode: :kReturn: code=4',),), error_no=4, all_cost=9.246337175369263, timestamp=1558606330.5984135)	[('tile_bna', 16), ('tile_bnb', 16), ('tile_t1', [4, 16]), ('tile_t2', [32, 2]), ('c_unroll', [16, 4]), ('yt', 2)],winograd,None,7649
DEBUG:autotvm:No: 3	GFLOPS: 0.00/6.20	result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (3) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7fe0569097d1]\n  [bt] (2) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x9e486b) [0x7fe05694c86b]\n  [bt] (1) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x9d9db7) [0x7fe056941db7]\n  [bt] (0) /home/SERILOCAL/n.perto/Documents/tvm/build/libtvm.so(+0x172ab2) [0x7fe0560daab2]\n  File "/home/SERILOCAL/n.perto/Documents/tvm/src/runtime/rpc/rpc_session.cc", line 962\nTVMError: Check failed: code == RPCCode: :kReturn: code=4',),), error_no=4, all_cost=19.06636929512024, timestamp=1558606340.6348014)	[('tile_bna', 2), ('tile_bnb', 16), ('tile_t1', [4, 16]), ('tile_t2', [1, 64]), ('c_unroll', [64, 1]), ('yt', 8)],winograd,None,15871
DEBUG:autotvm:No: 4	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606442.982189)	[('tile_bna', 2), ('tile_bnb', 8), ('tile_t1', [1, 64]), ('tile_t2', [16, 4]), ('c_unroll', [8, 8]), ('yt', 16)],winograd,None,23791
DEBUG:autotvm:No: 5	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0251107)	[('tile_bna', 2), ('tile_bnb', 2), ('tile_t1', [8, 8]), ('tile_t2', [1, 64]), ('c_unroll', [32, 2]), ('yt', 2)],winograd,None,7256
DEBUG:autotvm:No: 6	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0252697)	[('tile_bna', 8), ('tile_bnb', 1), ('tile_t1', [64, 1]), ('tile_t2', [8, 8]), ('c_unroll', [16, 4]), ('yt', 2)],winograd,None,7878
DEBUG:autotvm:No: 7	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0372736)	[('tile_bna', 2), ('tile_bnb', 8), ('tile_t1', [1, 64]), ('tile_t2', [32, 2]), ('c_unroll', [16, 4]), ('yt', 16)],winograd,None,22391
DEBUG:autotvm:No: 8	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0462408)	[('tile_bna', 16), ('tile_bnb', 1), ('tile_t1', [4, 16]), ('tile_t2', [32, 2]), ('c_unroll', [32, 2]), ('yt', 16)],winograd,None,21104
DEBUG:autotvm:No: 9	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0463283)	[('tile_bna', 16), ('tile_bnb', 16), ('tile_t1', [32, 2]), ('tile_t2', [8, 8]), ('c_unroll', [16, 4]), ('yt', 1)],winograd,None,3024
DEBUG:autotvm:No: 10	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0531096)	[('tile_bna', 8), ('tile_bnb', 4), ('tile_t1', [16, 4]), ('tile_t2', [16, 4]), ('c_unroll', [64, 1]), ('yt', 4)],winograd,None,10213
DEBUG:autotvm:No: 11	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0574272)	[('tile_bna', 8), ('tile_bnb', 8), ('tile_t1', [32, 2]), ('tile_t2', [32, 2]), ('c_unroll', [32, 2]), ('yt', 16)],winograd,None,21043
DEBUG:autotvm:No: 12	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606443.0574942)	[('tile_bna', 8), ('tile_bnb', 4), ('tile_t1', [8, 8]), ('tile_t2', [64, 1]), ('c_unroll', [32, 2]), ('yt', 2)],winograd,None,6213
DEBUG:autotvm:No: 13	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.5806856)	[('tile_bna', 16), ('tile_bnb', 2), ('tile_t1', [32, 2]), ('tile_t2', [8, 8]), ('c_unroll', [16, 4]), ('yt', 4)],winograd,None,12809
DEBUG:autotvm:No: 14	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.592348)	[('tile_bna', 16), ('tile_bnb', 16), ('tile_t1', [2, 32]), ('tile_t2', [1, 64]), ('c_unroll', [8, 8]), ('yt', 1)],winograd,None,4874
DEBUG:autotvm:No: 15	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.5987117)	[('tile_bna', 1), ('tile_bnb', 16), ('tile_t1', [64, 1]), ('tile_t2', [32, 2]), ('c_unroll', [16, 4]), ('yt', 2)],winograd,None,7545
DEBUG:autotvm:No: 16	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.5987988)	[('tile_bna', 2), ('tile_bnb', 8), ('tile_t1', [8, 8]), ('tile_t2', [2, 32]), ('c_unroll', [32, 2]), ('yt', 1)],winograd,None,2191
DEBUG:autotvm:No: 17	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.6042001)	[('tile_bna', 8), ('tile_bnb', 1), ('tile_t1', [64, 1]), ('tile_t2', [2, 32]), ('c_unroll', [64, 1]), ('yt', 4)],winograd,None,10678
DEBUG:autotvm:No: 18	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.6101496)	[('tile_bna', 8), ('tile_bnb', 2), ('tile_t1', [1, 64]), ('tile_t2', [8, 8]), ('c_unroll', [64, 1]), ('yt', 8)],winograd,None,15383
DEBUG:autotvm:No: 19	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.6102304)	[('tile_bna', 2), ('tile_bnb', 1), ('tile_t1', [1, 64]), ('tile_t2', [16, 4]), ('c_unroll', [32, 2]), ('yt', 16)],winograd,None,21326
DEBUG:autotvm:No: 20	GFLOPS: 0.00/6.20	result: MeasureResult(costs=('',), error_no=7, all_cost=5, timestamp=1558606564.6102855)	[('tile_bna', 4), ('tile_bnb', 16), ('tile_t1', [1, 64]), ('tile_t2', [1, 64]), ('c_unroll', [8, 8]), ('yt', 8)],winograd,None,19597
DEBUG:autotvm:XGB load 20 entries from history log file
Traceback (most recent call last):
  File "tutorials/autotvm/tune_relay_mobile_gpu.py", line 358, in <module>
    tune_and_evaluate(tuning_option)
  File "tutorials/autotvm/tune_relay_mobile_gpu.py", line 314, in tune_and_evaluate
    tune_tasks(tasks, **tuning_opt)
  File "tutorials/autotvm/tune_relay_mobile_gpu.py", line 295, in tune_tasks
    autotvm.callback.log_to_file(tmp_log_file)])
  File "/home/SERILOCAL/n.perto/Documents/tvm/python/tvm/autotvm/tuner/xgboost_tuner.py", line 86, in tune
    super(XGBTuner, self).tune(*args, **kwargs)
  File "/home/SERILOCAL/n.perto/Documents/tvm/python/tvm/autotvm/tuner/tuner.py", line 108, in tune
    measure_batch = create_measure_batch(self.task, measure_option)
  File "/home/SERILOCAL/n.perto/Documents/tvm/python/tvm/autotvm/measure/measure.py", line 252, in create_measure_batch
    attach_objects = runner.set_task(task)
  File "/home/SERILOCAL/n.perto/Documents/tvm/python/tvm/autotvm/measure/measure_methods.py", line 212, in set_task
    raise RuntimeError("Cannot get remote devices from the tracker. "
RuntimeError: Cannot get remote devices from the tracker. Please check the status of tracker by 'python -m tvm.exec.query_rpc_tracker --port [THE PORT YOU USE]' and make sure you have free devices on the queue status.

I am running it locally with port forwarding, as mentioned here, I cannot connect the devices on the same network so that is the only option for me.

Querying the tracker just after the crash outputs:

Tracker address localhost:9190

Server List
----------------------------
server-address	key
----------------------------
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
   0      0     18     
---------------------------

And after a little while

Tracker address localhost:9190

Server List
----------------------------
server-address	key
----------------------------
127.0.0.1:43289	server:
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
   1      1     0      
---------------------------

Do you have any idea what can be the cause of the problem and what can I do to solve it?
Thanks


#2

This is a problem likely caused by the device going offline temporarily after some bad configs. You can try increasing the timeout in the check_remote call here to see if it fixes the issue.


#3

Hi,

I am trying to start the rpc_tracker, when I enter the following command,

python -m tvm.exec.rpc_proxy

it says

/usr/bin/python: No module named tvm.exec

Any help? Thanks!


#4

Do you have tvm in your PYTHONPATH?

In the future, please make a separate thread for questions separate from the thread topic.


#5

I have installed tvm using pip.

tvm is in /usr/local/lib/python3.6/dist-packages/

Do I have to add this to PATH?

EDIT: I haven’t set up PYTHONPATH, doing it now.


#6

Unfortunately the tvm in pip is unrelated to this project. You should install tvm from source https://docs.tvm.ai/install/from_source.html


#7

:joy: but with that version of TVM installed with pip, you’ll be able to check out how much interest you’ll get in 5 years in your savings account with:

tvm --pv=10000 --rate=5 --freq=12 --years=5 fv

#8

Thank you very much! Sorry for not following the instructions properly.


#9

What do you mean by bad configs?
Unfortunately I have just tried changing the timeout up to 5min without success.


#10

During the tuning process, the tuner will try many schedule configurations—some of which may crash the runtime on the device or timeout. How long does the device take to come back up after going down? The timeout I linked to should be at least as long as this time.
You can also try bypassing the remote check as a temporary hack.


#11

Removing the remote check seems to work for me, I have a new issue at inference time though… That’s the subject of a new post.
Thank you.