[Android RPC] Stability Issues

I’ve been working on tuning models on a Galaxy S10 using the RPC but am having some difficulties keeping the app alive. Although everything goes smoothly initially and I can tune 1 or 2 layers, eventually the app crashes, sometimes even causing the phone to restart. I have quite a few workloads that need to be tuned so constantly babysitting the app isn’t a great option. Has anyone seen this behavior before or have any ideas for workarounds?

I’m still unable to consistently run the rpc server on galaxy phones (both for cpu and gpu tuning), although it works fine on Pixels. Unfortunately, Pixel phones dont have OpenCL which is the target I’m interested in testing. @yzhliu, what opencl enabled phone did you use when testing the android rpc server?

cc @eqy as far as i know we do have some fault tolerance schemes

In my experience, Android RPC stability is a difficult beast to tame. Our current RPC tuning app uses a watchdog and separate process + activity (as a “workhorse”) for each kernel configuration to be tuned. However, this doesn’t seem to be 100% perfect in terms of crash isolation, especially when other components like the OpenCL driver are involved.

One solution/workaround I was considering to implement is yet another level of watchdog. If you have a machine with a spare USB port that has adb/other tools installed, it’s probably feasible to write a small watchdog that periodically checks if the RPC app has crashed on the phone and restarts it.

1 Like

Maybe we need C++ API RPC.

hi @jwfromm I have encountered the same problem on samsung S8,have you solved this?

Unfortunately I never resolved these stability issues. Hopefully we can do some refactoring of the RPC app soon.

We have C++ RPC already. I think you could try this. I have used it successfully for android embed system without ui.

1 Like

@FrozenGene @jwfromm thanks for you guys reply, I will try C++ RPC laterly.

Hi @FrozenGene,I tried C++ rpc but get some errors ,

firstly I cross compile tvm_rpc and libtvm_runtime.so in my PC.Then push everhting to android device and test the tvm_rpc as follows

step 1

adb push tvm_rpc /data/local/tmp
adb push libtvm_runtime.so /data/local/tmp
adb shell
cd /data/local/tmp
./tvm_rpc server --host=0.0.0.0 --port=9000 --port-end=9090 --tracker=10.220.109.215:9190 --key=s8 

step 2

python -m tvm.exec.rpc_tracker --port 9190
INFO:root:If you are running ROCM/Metal, fork will cause compiler internal error. Try to launch with arg ```--no-fork```
INFO:RPCTracker:bind to 0.0.0.0:9190

python -m tvm.exec.query_rpc_tracker --port 9190
Tracker address localhost:9190

Server List
----------------------------
server-address	key
----------------------------
10.220.109.157:60174	server:s8
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
s8   1      1     0      
---------------------------

step3

python tune_relay_mobile_gpu.py

firstly everthing goes well,but after about 20s ,I got errors as follows

RuntimeError: Cannot get remote devices from the tracker. Please check the status of tracker by 'python -m tvm.exec.query_rpc_tracker --port [THE PORT YOU USE]' and make sure you have free devices on the queue status.

I double check the step 4

python -m tvm.exec.query_rpc_tracker --port 9190

Tracker address localhost:9190

Server List
----------------------------
server-address	key
----------------------------
10.220.109.157:60174	server:s8
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
s8   1      0     0      
---------------------------

Seems that your tracker and device can not build the connection. Could you try to restart the tracker / device’s rpc and watch the result?

Hi @FrozenGene, i restart tracker and device’s rpc the result

before i run tune_relay_mobile_gpu.py, checked the available device output was

Tracker address localhost:9191

Server List
----------------------------
server-address	key
----------------------------
10.220.109.157:51874	server:s8
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
s8   1      1     0      
---------------------------

after i run tune_relay_mobile_gpu.py, checked the available device output was

Tracker address localhost:9191

Server List
----------------------------
server-address	key
----------------------------
10.220.109.157:51874	server:s8
----------------------------

Queue Status
---------------------------
key   total  free  pending
---------------------------
s8   1      0     0      
---------------------------

When you are running the tune, it is expected the device is used. However, the problem is on your screen capture. And there are many errors. I think we could do these things to narrow down.

  1. Tune arm cpu, not opencl. Which could help to exclude the opencl program.

  2. We could consider using C++11 std::future here: https://github.com/apache/incubator-tvm/blob/master/apps/cpp_rpc/rpc_server.cc#L197. I have been noticed in one team of my company, they work only fine on some android platform using libcxx and C++11 std::future.

@FrozenGene Thanks for your tips. I try to run tune_relay_arm.py, it works well.

so, do I have to cross compile tvm runtime for android with USE_OPENCL?

Yes. The libtvm_runtime.so should contains OpenCL if your target is OpenCL.

@FrozenGene Thanks again. sorry to bother you so much.

I am trying cross compile android tvm runtime ,but if enable USE_OPENCL.

Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.0") 

how can i make android libOpenCL.so to find by cmake. Is there any documents about how to cross compile with OpenCL.

Try this : https://cmake.org/cmake/help/v3.1/module/FindOpenCL.html

@FrozenGene Ok, thank you very much.!!!:+1::grinning:

Hi @FrozenGene In order to use mater branch tvm runtime(with Asymmetric padding (#4511)), so I try to build latest cpp_rpc (master branch) , but failed

main.cc:34:10: fatal error: '../../src/common/util.h' file not found
#include "../../src/common/util.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
rpc_env.cc:39:10: fatal error: '../../src/common/util.h' file not found
#include "../../src/common/util.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:35:10: fatal error: '../../src/common/socket.h' file not found
#include "../../src/common/socket.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Makefile:49: recipe for target 'tvm_rpc' failed
make: *** [tvm_rpc] Error 1

after cp tvm 0.6.0 src/common to master and build

rpc_env.cc:167:28: error: use of undeclared identifier 'support'
    auto executed_status = support::Execute(cmd, &err_msg);
                           ^
rpc_env.cc:198:7: error: use of undeclared identifier 'support'
  if (support::EndsWith(file, ".so")) {
      ^
rpc_env.cc:204:9: error: use of undeclared identifier 'support'
    if (support::EndsWith(file, ".o")) {
        ^
rpc_env.cc:208:16: error: use of undeclared identifier 'support'
    } else if (support::EndsWith(file, ".tar")) {
               ^
rpc_env.cc:213:29: error: use of undeclared identifier 'support'
      int executed_status = support::Execute(cmd, &err_msg);
                            ^
5 errors generated.
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:135:35: error: no type named 'TCPSocket' in namespace 'tvm::support'; did you mean 'common::TCPSocket'?
  void WaitConnectionAndUpdateKey(support::TCPSocket listen_sock,
                                  ^~~~~~~~~~~~~~~~~~
                                  common::TCPSocket
./../../src/common/socket.h:374:7: note: 'common::TCPSocket' declared here
class TCPSocket : public Socket {
      ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:192:3: error: no type named 'TCPSocket' in namespace 'tvm::support'; did you mean 'common::TCPSocket'?
  support::TCPSocket ConnectWithRetry(int timeout = 60, int retry_period = 5) {
  ^~~~~~~~~~~~~~~~~~
  common::TCPSocket
./../../src/common/socket.h:374:7: note: 'common::TCPSocket' declared here
class TCPSocket : public Socket {
      ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:238:3: error: no type named 'TCPSocket' in namespace 'tvm::support'; did you mean 'common::TCPSocket'?
  support::TCPSocket tracker_sock_;
  ^~~~~~~~~~~~~~~~~~
  common::TCPSocket
./../../src/common/socket.h:374:7: note: 'common::TCPSocket' declared here
class TCPSocket : public Socket {
      ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:143:29: error: use of undeclared identifier 'poller'; did you mean 'poll'?
        support::PollHelper poller;
                            ^~~~~~
                            poll
/home/xj-zjd/work_space/self_work/tvm_code/android-toolchain-arm64-r18b/bin/../sysroot/usr/include/poll.h:41:5: note: 'poll' declared here
int poll(struct pollfd* __fds, nfds_t __count, int __timeout_ms);
    ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:143:28: error: expected ';' after expression
        support::PollHelper poller;
                           ^
                           ;
./rpc_tracker_client.h:143:18: error: no member named 'PollHelper' in namespace 'tvm::support'
        support::PollHelper poller;
        ~~~~~~~~~^
./rpc_tracker_client.h:144:9: error: use of undeclared identifier 'poller'; did you mean 'poll'?
        poller.WatchRead(listen_sock.sockfd);
        ^~~~~~
        poll
/home/xj-zjd/work_space/self_work/tvm_code/android-toolchain-arm64-r18b/bin/../sysroot/usr/include/poll.h:41:5: note: 'poll' declared here
int poll(struct pollfd* __fds, nfds_t __count, int __timeout_ms);
    ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:144:15: error: member reference base type 'int (struct pollfd *, nfds_t, int)' (aka 'int (pollfd *, unsigned int, int)') is not a structure or union
        poller.WatchRead(listen_sock.sockfd);
        ~~~~~~^~~~~~~~~~
./rpc_tracker_client.h:145:9: error: use of undeclared identifier 'poller'; did you mean 'poll'?
        poller.Poll(ping_period * 1000);
        ^~~~~~
        poll
/home/xj-zjd/work_space/self_work/tvm_code/android-toolchain-arm64-r18b/bin/../sysroot/usr/include/poll.h:41:5: note: 'poll' declared here
int poll(struct pollfd* __fds, nfds_t __count, int __timeout_ms);
    ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:145:15: error: member reference base type 'int (struct pollfd *, nfds_t, int)' (aka 'int (pollfd *, unsigned int, int)') is not a structure or union
        poller.Poll(ping_period * 1000);
        ~~~~~~^~~~~
./rpc_tracker_client.h:146:14: error: use of undeclared identifier 'poller'
        if (!poller.CheckRead(listen_sock.sockfd)) {
             ^
./rpc_tracker_client.h:143:29: warning: expression result unused [-Wunused-value]
        support::PollHelper poller;
                            ^~~~~~
./rpc_tracker_client.h:195:24: error: expected ';' after expression
      support::SockAddr addr(tracker_addr_);
                       ^
                       ;
./rpc_tracker_client.h:195:16: error: no member named 'SockAddr' in namespace 'tvm::support'
      support::SockAddr addr(tracker_addr_);
      ~~~~~~~~~^
./rpc_tracker_client.h:195:25: error: use of undeclared identifier 'addr'
      support::SockAddr addr(tracker_addr_);
                        ^
./rpc_tracker_client.h:196:7: error: no type named 'TCPSocket' in namespace 'tvm::support'; did you mean 'common::TCPSocket'?
      support::TCPSocket sock;
      ^~~~~~~~~~~~~~~~~~
      common::TCPSocket
./../../src/common/socket.h:374:7: note: 'common::TCPSocket' declared here
class TCPSocket : public Socket {
      ^
In file included from rpc_server.cc:39:
./rpc_tracker_client.h:198:48: error: use of undeclared identifier 'addr'
      LOG(INFO) << "Tracker connecting to " << addr.AsString();
                                               ^
./rpc_tracker_client.h:199:24: error: use of undeclared identifier 'addr'
      if (sock.Connect(addr)) {
                       ^
./rpc_tracker_client.h:205:67: error: use of undeclared identifier 'addr'
      CHECK(period < timeout) << "Failed to connect to server" << addr.AsString();
                                                                  ^
./rpc_tracker_client.h:206:55: error: use of undeclared identifier 'addr'
      LOG(WARNING) << "Cannot connect to tracker " << addr.AsString()
                                                      ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
1 warning and 20 errors generated.
Makefile:49: recipe for target 'tvm_rpc' failed
make: *** [tvm_rpc] Error 1

when will you update the cpp_rpc code to fix this?

@zjd1988 try this: https://github.com/apache/incubator-tvm/pull/4725