Issue during autotuning network from tensorflow

Hi, I’m trying to autotune the network defined in tensorflow and encountered issue with local runner.

To test whether I can auto-tune the model defined in tensorflow, I combined two tutorials. Simply, autotuning function from “tune_relay_cuda.py” is added into “from_tensorflow.py”. Here is the tuning option I’m using.

tuning_option = {
    'log_filename': log_file,

    'tuner': 'xgb',
    'n_trial': 50,
    'early_stopping': 600,

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(timeout=10),
        runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),
        #runner=autotvm.RPCRunner(
        #    '1080ti',  # change the device key to your key
        #    '0.0.0.0', 9190,
        #    number=20, repeat=3, timeout=4, min_repeat_ms=150)
    ),
}

This is the output I’m getting. After this error, tuning process hangs without any progress.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[Task 34/50]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/50) | 0.00 s[New Thread 0x7ffe5fffb700 (LWP 4660)]
[Thread 0x7ffe5fffb700 (LWP 4660) exited]
[Task 34/50]  Current/Best:    0.00/ 248.20 GFLOPS | Progress: (48/50) | 125.58 sProcess Process-1:
Traceback (most recent call last):
  File "/home/sung/tvm/python/tvm/rpc/base.py", line 167, in connect_with_retry
    sock.connect(addr)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sung/tvm/python/tvm/rpc/server.py", line 195, in _listen_loop
    raise exc
  File "/home/sung/tvm/python/tvm/rpc/server.py", line 175, in _listen_loop
    tracker_conn = base.connect_with_retry(tracker_addr)
  File "/home/sung/tvm/python/tvm/rpc/base.py", line 175, in connect_with_retry
    "Failed to connect to server %s" % str(addr))
RuntimeError: Failed to connect to server ('0.0.0.0', 9000)
Traceback (most recent call last):

  File "tune_tf.py", line 219, in <module>
    tune_tasks(tasks, **tuning_option)

  File "tune_tf.py", line 150, in tune_tasks
    autotvm.callback.log_to_file(tmp_log_file)

  File "/home/sung/tvm/python/tvm/autotvm/tuner/xgboost_tuner.py", line 90, in tune
    super(XGBTuner, self).tune(*args, **kwargs)

  File "/home/sung/tvm/python/tvm/autotvm/tuner/tuner.py", line 131, in tune
    results = measure_batch(inputs)

  File "/home/sung/tvm/python/tvm/autotvm/measure/measure.py", line 262, in measure_batch
    results = runner.run(measure_inputs, build_results)

  File "/home/sung/tvm/python/tvm/autotvm/measure/measure_methods.py", line 278, in run
    raise Exception(f'encountered exception during measurement: {results}')

Exception: encountered exception during measurement: [MeasureResult(costs=("Failed to connect to server ('0.0.0.0', 9000)",), error_no=7, all_cost=4, timestamp=1590363941.0128846)]

Here is the backtrace in gdb after ctrl+c

  (gdb) bt
    #0  0x00007ffff77d56d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7fff66fea000) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
    #1  do_futex_wait (sem=sem@entry=0x7fff66fea000, abstime=0x0) at sem_waitcommon.c:111
    #2  0x00007ffff77d57c8 in __new_sem_wait_slow (sem=0x7fff66fea000, abstime=0x0) at sem_waitcommon.c:181
    #3  0x00007fff72462d58 in ?? () from /usr/lib/python3.6/lib-dynload/_multiprocessing.cpython-36m-x86_64-linux-gnu.so
    #4  0x000000000050a635 in ?? ()
    #5  0x000000000050bfb4 in _PyEval_EvalFrameDefault ()
    #6  0x0000000000509758 in ?? ()
    #7  0x000000000050a48d in ?? ()
    #8  0x000000000050bfb4 in _PyEval_EvalFrameDefault ()
    #9  0x0000000000508e55 in _PyFunction_FastCallDict ()
    #10 0x0000000000594931 in ?? ()
    #11 0x000000000059fc4e in PyObject_Call ()
    #12 0x000000000050d356 in _PyEval_EvalFrameDefault ()
    #13 0x0000000000507d64 in ?? ()
    #14 0x00000000005090b7 in _PyFunction_FastCallDict ()
    #15 0x0000000000594931 in ?? ()
    #16 0x000000000054a941 in ?? ()
    #17 0x00000000005a9cbc in _PyObject_FastCallKeywords ()
    #18 0x000000000050a5c3 in ?? ()
    #19 0x000000000050bfb4 in _PyEval_EvalFrameDefault ()
    #20 0x0000000000507d64 in ?? ()
    #21 0x0000000000509a90 in ?? ()
    #22 0x000000000050a48d in ?? ()
    #23 0x000000000050bfb4 in _PyEval_EvalFrameDefault ()
    #24 0x0000000000507d64 in ?? ()
    #25 0x0000000000588dcd in ?? ()
    #26 0x000000000059fc4e in PyObject_Call ()
    #27 0x00000000005de47d in ?? ()
    #28 0x0000000000637df4 in Py_FinalizeEx ()
    #29 0x0000000000638e95 in Py_Main ()
    #30 0x00000000004b0d00 in main ()

Any thoughts or advice will be greatly helpful. Thank you!