How to profile speed in each layer with RPC?

Hi,

I try to profile my model on my android phone using tvm debugger with rpc.

But it stuck in the graph_runtime.create, and say some function not found.
(AttributeError: Module has no function ‘tvm.graph_runtime_debug.remote_create’)

So currently tvm debugger don’t support rpc ?

Is there anyway to profile speed layer by layer on my android device?

1 Like

Hello, I had a similar problem, although not for Android, but an embedded ARM/Linux board. So here is how it worked for me, maybe a similar solution works for you too:

  • If you want to be 100% sure to avoid inconsistencies, build the SAME TVM for the host and embedded target.

  • Building for the x86_64/Linux - follow this tutorial

    • Building for the ARM/Linux: I’ve used a cross-compilation approach:
    • Copy your main tvm folder into something like tvm-arm
    • Clear your build directory, but perhaps leave your config.cmake
    • Important to have the following options enabled for the embedded target:
set(USE_RPC ON)
set(USE_GRAPH_RUNTIME ON)
set(USE_GRAPH_RUNTIME_DEBUG ON)
  • Set the path to your cross-compiler in the shell you are going to invoke make from:
(Please adapt paths to correct location)
export CC=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
export CXX=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-g++
cd build
cmake ..
make -j 8
  • Copy over the whole tvm-arm directory, but at least the build directory (and the shared objects within) to your embedded system
  • Set the following environment variables on the embedded device in the shell you are going to start the TVM RPC from:
(Please adapt paths to correct location)
export TVM_HOME=/home/root/tvm-arm
export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:${PYTHONPATH}
export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
  • Now you can start an RPC on your embedded device as follows:
python -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090 --no-fork # <- use the same port in the target application later
  • From your TVM host application, you need to import and use the debug runtime:
from tvm.contrib.debugger import debug_runtime as graph_runtime
  • Whenever you build the graph runtime, you invoke it as follows:
...
rtmodule = graph_runtime.create(graph, rlib, ctx, dump_root='/tmp/tvmdbg/')
...
  • Now whenever you invoke run(), the debug runtime will give you some profiling information from the embedded device, e.g.:
Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs
---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------
1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1
_contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1
relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1
_contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1
relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1
_contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1
relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1
_contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1
reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1

Maybe It won’t work for you exactly like this, but the steps must be similar! Good luck! Cheers, Robert

1 Like

Also had the same problem. Check Profiling Report C++ for the solution presented there (both C++ and Python)