How to profile speed in each layer with RPC?

haroldyag977 · February 15, 2019, 1:56am

Hi,

I try to profile my model on my android phone using tvm debugger with rpc.

But it stuck in the graph_runtime.create, and say some function not found.
(AttributeError: Module has no function ‘tvm.graph_runtime_debug.remote_create’)

So currently tvm debugger don’t support rpc ?

Is there anyway to profile speed layer by layer on my android device?

robeastbme · February 21, 2020, 1:28pm

Hello, I had a similar problem, although not for Android, but an embedded ARM/Linux board. So here is how it worked for me, maybe a similar solution works for you too:

If you want to be 100% sure to avoid inconsistencies, build the SAME TVM for the host and embedded target.
Building for the x86_64/Linux - follow this tutorial
- Building for the ARM/Linux: I’ve used a cross-compilation approach:
- Copy your main tvm folder into something like tvm-arm
- Clear your build directory, but perhaps leave your config.cmake
- Important to have the following options enabled for the embedded target:

set(USE_RPC ON)
set(USE_GRAPH_RUNTIME ON)
set(USE_GRAPH_RUNTIME_DEBUG ON)

Set the path to your cross-compiler in the shell you are going to invoke make from:

(Please adapt paths to correct location)
export CC=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
export CXX=/my/path/to/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-g++
cd build
cmake ..
make -j 8

Copy over the whole tvm-arm directory, but at least the build directory (and the shared objects within) to your embedded system
Set the following environment variables on the embedded device in the shell you are going to start the TVM RPC from:

(Please adapt paths to correct location)
export TVM_HOME=/home/root/tvm-arm
export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:${PYTHONPATH}
export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH

Now you can start an RPC on your embedded device as follows:

python -m tvm.exec.rpc_server --host 0.0.0.0 --port=9090 --no-fork # <- use the same port in the target application later

From your TVM host application, you need to import and use the debug runtime:

from tvm.contrib.debugger import debug_runtime as graph_runtime

Whenever you build the graph runtime, you invoke it as follows:

...
rtmodule = graph_runtime.create(graph, rlib, ctx, dump_root='/tmp/tvmdbg/')
...

Now whenever you invoke run(), the debug runtime will give you some profiling information from the embedded device, e.g.:

Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs
---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------
1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1
_contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1
relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1
_contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1
relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1
_contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1
relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1
_contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1
reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1

Maybe It won’t work for you exactly like this, but the steps must be similar! Good luck! Cheers, Robert

misterBart · August 10, 2021, 8:57am

Also had the same problem. Check Profiling Report C++ for the solution presented there (both C++ and Python)