I am autotuning the TVM Testing MobileNet with the main application (autotuning loop, building) and RPC tracker running on one server, and multiple RPC servers for remote execution/measurement running on two other physical servers with the same GPU model.
Occasionally I get this error, which then makes the current task fail:
... File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 235, in get_build_kwargs remote = request_remote(self.key, self.host, self.port) File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 535, in request_remote session_timeout=timeout) File "/usr/tvm/python/tvm/rpc/client.py", line 329, in request key, max_retry, str(last_err))) RuntimeError: Cannot request k80 after 5 retry, last_error:Traceback (most recent call last): [bt] (4) /usr/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f721515d935] [bt] (3) /usr/tvm/build/libtvm.so(+0x9b5eb4) [0x7f72151beeb4] [bt] (2) /usr/tvm/build/libtvm.so(+0x9b3677) [0x7f72151bc677] [bt] (1) /usr/tvm/build/libtvm.so(+0x9aeb74) [0x7f72151b7b74] [bt] (0) /usr/tvm/build/libtvm.so(+0x153863) [0x7f721495c863] File "/usr/tvm/src/runtime/rpc/rpc_socket_impl.cc", line 80 TVMError: URL server:9104 cannot find server that matches key=client:k80:0.7114652791680667 -timeout=60
I get a lot of these messages in the log of the RPC server:
mismatch key from ('10.0.0.15', 47672) no incoming connections, regenerate key ...
However, this does not happen when I only use RPC servers on one physical servers. Maybe all of these messages are related.
Does anyone know what I might do to fix this?
What is the reason that match keys expire? To prevent clients from hogging servers even if they’re not using them? If so, could the
unmatch_timeout of the server be made configurable?