Error while loading params to target in RPC session

TVMError: Socket SockChannel::Recv Error:Connection reset by peer

/home/kalyan/libraries/tvm/python/tvm/contrib/graph_runtime.py(164)set_input() -> self._get_input(k).copyfrom(params[k])

In remote terminal where rpc server running showing

INFO:RPCServer:connection from (‘192.168.1.29’, 46908)

INFO:RPCServer:load_module /tmp/tmptb25g636/resnet50_3df.tar

corrupted size vs. prev_size

Hi All,

If i run same code without module.set_input(**params) code works with out any crash but results were zero because of weights params buffer values were zero . I see that following comment in graph_runtime.py " upload big arrays first to avoid memory issue in rpc mode " Is there any memory issue in copying params in rpc mode?

I have stuck on this to processed further Could you please help on this?

thanks

I’m also experiencing a similar issue but instead with the BYOC flow. Unlike your example, my params are bound to the graph as constants which my 3rd party codegen library will serialize as part of the module. I get memory corruption when I try to create a graph runtime with my compiled module on the remote.

I did a bit of digging into this and the issue is because the RPC server uses a ring buffer to read/write from/to the remote device. If the data you are writing is larger than the capacity of the buffer, it will overwrite previous data, causing memory corruption.

I wonder if the capacity of this buffer can be increased manually?

cc @tqchen @comaniac @zhiics

I think this issue is not directly related to BYOC but the RPC mechanism.

cc @FrozenGene

1 Like

Try to set timeout larger (like 1000 or anything). The default timeout can not support this to complete.

hi,

I have tried with 5000(rpc.connect(host, port, session_timeout=5000)). Still getting same error.

This also didn’t work for me, is there another timeout somewhere else that should be set? session_timeout seems unrelated.

Thanks, @lhutton1, Also quickly i have tried with latest master code , where I see that the couple of updates went into RPC module and still i am seeing the same issue.

Thanks, @FrozenGene, Could you please kindly help me to point the snippet code where I need to do your suggested modification in the TVM stack

The timeout is set correctly. Hmm…I faced this issue before and just set the timeout could solve. Maybe @lhutton1’s investigation is correct.

This issue seems like it’s related, https://github.com/apache/incubator-tvm/issues/5514

@kalyan do you still see the issue using this commit? 9a8ed5b

It looks like there may be a potential fix: https://github.com/apache/incubator-tvm/pull/5516

2 Likes

thanks @lhutton1 , the commit 9a8ed5b works fine for me.

2 Likes

I am having this issue, and i found a very small model could work but had this error if the model size is increased. And agree with Kalyan. commit 9a8ed5b seems working for me. Thanks!

Please see if https://github.com/apache/incubator-tvm/pull/5516 fixed the problem

2 Likes

Thanks! It works. Just need to make sure restaring your RPC tracker.

Hi All, Thanks a lot for your support on this issue

May I ask if TVM12.0 still encounters this error? How should I modify it
TVMError:Socket SockChannel::Recv Error :Connect reset by peer. 我在Android端运行, outputs = [module.get_output(i).asnumpy() for i in range(module.get_num_outputs())] 这里只能选择get_output(0)不报错,get_output(1)就报错