Lots of key mismatches when autotuning with RPC

Hi,

I am autotuning the TVM Testing MobileNet with the main application (autotuning loop, building) and RPC tracker running on one server, and multiple RPC servers for remote execution/measurement running on two other physical servers with the same GPU model.

Occasionally I get this error, which then makes the current task fail:

  ...
  File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 235, in get_build_kwargs
    remote = request_remote(self.key, self.host, self.port)
  File "/usr/tvm/python/tvm/autotvm/measure/measure_methods.py", line 535, in request_remote
    session_timeout=timeout)
  File "/usr/tvm/python/tvm/rpc/client.py", line 329, in request
    key, max_retry, str(last_err)))
RuntimeError: Cannot request k80 after 5 retry, last_error:Traceback (most recent call last):
  [bt] (4) /usr/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f721515d935]
  [bt] (3) /usr/tvm/build/libtvm.so(+0x9b5eb4) [0x7f72151beeb4]
  [bt] (2) /usr/tvm/build/libtvm.so(+0x9b3677) [0x7f72151bc677]
  [bt] (1) /usr/tvm/build/libtvm.so(+0x9aeb74) [0x7f72151b7b74]
  [bt] (0) /usr/tvm/build/libtvm.so(+0x153863) [0x7f721495c863]
  File "/usr/tvm/src/runtime/rpc/rpc_socket_impl.cc", line 80
TVMError: URL server:9104 cannot find server that matches key=client:k80:0.7114652791680667 -timeout=60

I get a lot of these messages in the log of the RPC server:

mismatch key from ('10.0.0.15', 47672)
no incoming connections, regenerate key ...

However, this does not happen when I only use RPC servers on one physical servers. Maybe all of these messages are related.

Does anyone know what I might do to fix this?

What is the reason that match keys expire? To prevent clients from hogging servers even if they’re not using them? If so, could the unmatch_timeout of the server be made configurable?

Has anyone had the issue before and knows how to fix it?

  1. A larger timeout
  2. Server and client device directly connected, without passing through the exchanger
    I am not sure if these are useful, but you can try it.
  1. Which timeout are you referring to? I think this is an issue with the server/tracker infrastructure, and neither tvm.rpc.tracker.Tracker nor tvm.rpc.server.Server have a timeout option. I am running everything using Docker, and the containers are connected using an overlay network.

  2. What do you mean by exchanger?

Are there any new information on this issue?

It seems this was an issue with the Docker overlay network connecting all containers in the swarm. When running each RPC server as individual service as opposed to replicas of the same service, the issue is resolved.