TVM's get_output function is time-consuming with Mali openCL on RK3399

I encounter the issue of huge time processing for get_output function when i deploy mobinet_ssd’s model on RK3399 board with target openCL. Here is the part of my script:

m = graph_runtime.create(graph, lib, ctx)
m.set_input(**params)

begin = time.time()
m.set_input(‘data’, tvm.nd.array(x.astype(dtype)))
m.run()
end = time.time()
print('Run time: '+str(end-begin))

_, outshape = nnvm.compiler.graph_util.infer_shape(graph, shape={“data”: dshape})

begin2 = time.time()
tvm_output_0 = m.get_output(0, tvm.nd.empty(tuple(outshape[0]),‘float32’)) # shape: (1, 21, 1917)
tvm_output_1 = m.get_output(1, tvm.nd.empty(tuple(outshape[1]),‘float32’)) #shape: (1, 7668)
tvm_output_2 = m.get_output(2, tvm.nd.empty(tuple(outshape[2]),‘float32’)) # shape: (1, 1917, 4)
end4 = time.time()
print('get output time: '+str(end4-begin2))
begin5 = time.time()
final_output = get_multibox_detection_output_tvm(tvm_output_0, tvm_output_1, tvm_output_2, 1, num_anchor, 21)
end5 = time.time()
print('get get_multibox_detection_output_tvm: '+str(end5-begin5))

And the output:

  • runtime:0.02s
  • get output time: 1.95s
  • get_multibox_detection_output_tvm time: 1s

This was reported here 979. Is there any update for this issue or any other way to get result from TVM?

OpenCL kernels are lazy exectued. So the your time evaluation of run tells you nothing.
We will sync the stream when calling get_output

To make time evaluation correct, we have to time both run, get_output and do some warm up runs

2 Likes

Actually i tested with multiple images, and it was the result after several loop.
i also tested this code in my PC with nvidia-GTX1080, the get output time after several loops is only 0.0007s

I believe that mobilenet ssd cannot run in 50ms on RK3399. You should time both run and get_output. This is the real time cost, which is 0.05 + 1.95s in your case.

BTW, if you use RPC, you should also remove time cost due to network.

i checked again and actually the run time is about 0.02. Please note that this run time doesn’t include multibox_detection part- which run on cpu. So the real time cost is: run time + get_output time + multibox_detection time You can check my full script here

0.02s is impossible. You can see my benchmark result here https://github.com/dmlc/tvm/wiki/Benchmark#mobile-gpu. Mobilenet with highly tuned kernels takes about 80ms on rk3399 board.
So we are sure that the run is lazily executed.

In your script, you use ‘opencl’ as target. Then tvm will use the kernels optimized for cuda backend.
For mali GPU, you should use "opencl -device=mali" as target. We have different kernel implementation optimized for mali. Finally, to get the best performance, you have to follow the auto tuning tutorial (https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_mobile_gpu.html)

1 Like

sorry, this script for running on my PC. I use “target=tvm.target.mali, ctx = tvm.cl(0)” when deploy in RK3399. I changed target=“opencl -device=mali” but the time is the same. You can see the log here:

model compiled.

[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…
[11:11:47] /home/firefly/tvm_openCL_support_priorBox/src/pass/vectorize_loop.cc:303: Detect vector condition in Vectorized Loop, scalarizing…

=======================================================

=======================================================
[11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:242: Initialize OpenCL platform ‘ARM Platform’
[11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:251: No OpenCL device any device matched given the options: accelerator mode
[11:11:54] /home/firefly/tvm_openCL_support_priorBox/src/runtime/opencl/opencl_device_api.cc:269: opencl(0)=‘Mali-T860’ cl_device_id=0x7f93d40dd8
run time: 46.999, getout_time: 2.03895, detection_time: 0.687, total time: 49.725
run time: 0.018, getout_time: 1.91659, detection_time: 1.128, total time: 3.063
run time: 0.02, getout_time: 1.93148, detection_time: 1.118, total time: 3.069
run time: 0.02, getout_time: 1.91593, detection_time: 1.093, total time: 3.029
run time: 0.019, getout_time: 1.93582, detection_time: 1.14, total time: 3.095
run time: 0.018, getout_time: 1.93465, detection_time: 1.119, total time: 3.072
run time: 0.018, getout_time: 1.97412, detection_time: 1.143, total time: 3.135
run time: 0.019, getout_time: 1.92758, detection_time: 1.077, total time: 3.024
run time: 0.019, getout_time: 1.92943, detection_time: 1.156, total time: 3.104
run time: 0.019, getout_time: 1.9279, detection_time: 1.135, total time: 3.082
run time: 0.02, getout_time: 1.93719, detection_time: 1.136, total time: 3.093
run time: 0.019, getout_time: 1.97357, detection_time: 1.145, total time: 3.138
run time: 0.018, getout_time: 1.92685, detection_time: 1.138, total time: 3.083
run time: 0.02, getout_time: 1.96257, detection_time: 1.133, total time: 3.116
run time: 0.019, getout_time: 1.95402, detection_time: 1.134, total time: 3.107
run time: 0.019, getout_time: 1.92661, detection_time: 1.147, total time: 3.093
run time: 0.02, getout_time: 1.9345, detection_time: 1.141, total time: 3.095
run time: 0.018, getout_time: 1.92304, detection_time: 1.135, total time: 3.076
run time: 0.018, getout_time: 1.9267, detection_time: 1.139, total time: 3.084
run time: 0.018, getout_time: 1.92407, detection_time: 1.135, total time: 3.077
run time: 0.02, getout_time: 1.93869, detection_time: 1.116, total time: 3.075
run time: 0.018, getout_time: 1.91643, detection_time: 1.153, total time: 3.087
run time: 0.019, getout_time: 1.92286, detection_time: 1.101, total time: 3.043
run time: 0.019, getout_time: 1.95793, detection_time: 1.113, total time: 3.09
run time: 0.018, getout_time: 1.92309, detection_time: 1.074, total time: 3.015
run time: 0.019, getout_time: 1.91833, detection_time: 1.079, total time: 3.016
run time: 0.02, getout_time: 1.92622, detection_time: 1.097, total time: 3.043
run time: 0.018, getout_time: 1.93256, detection_time: 1.096, total time: 3.047
run time: 0.018, getout_time: 1.91949, detection_time: 1.086, total time: 3.023
run time: 0.018, getout_time: 1.94478, detection_time: 1.083, total time: 3.046
run time: 0.018, getout_time: 1.96931, detection_time: 1.076, total time: 3.063
run time: 0.021, getout_time: 1.91958, detection_time: 1.111, total time: 3.052
run time: 0.018, getout_time: 1.99231, detection_time: 1.07, total time: 3.08
run time: 0.019, getout_time: 1.92992, detection_time: 1.071, total time: 3.02
run time: 0.018, getout_time: 2.04414, detection_time: 1.067, total time: 3.129
run time: 0.018, getout_time: 1.92568, detection_time: 1.083, total time: 3.027
run time: 0.018, getout_time: 1.9227, detection_time: 1.082, total time: 3.023
run time: 0.019, getout_time: 1.92214, detection_time: 1.095, total time: 3.036
run time: 0.018, getout_time: 1.95153, detection_time: 1.066, total time: 3.036
run time: 0.019, getout_time: 1.92528, detection_time: 1.075, total time: 3.019
run time: 0.019, getout_time: 1.93487, detection_time: 1.062, total time: 3.016
run time: 0.02, getout_time: 1.92637, detection_time: 1.077, total time: 3.023
run time: 0.02, getout_time: 1.93431, detection_time: 1.066, total time: 3.02
run time: 0.02, getout_time: 1.93129, detection_time: 1.054, total time: 3.005
run time: 0.022, getout_time: 1.93616, detection_time: 1.064, total time: 3.022
run time: 0.022, getout_time: 1.90587, detection_time: 1.088, total time: 3.016
run time: 0.019, getout_time: 1.92704, detection_time: 1.078, total time: 3.024
run time: 0.021, getout_time: 1.97368, detection_time: 1.069, total time: 3.064
run time: 0.019, getout_time: 1.9386, detection_time: 1.087, total time: 3.045
run time: 0.02, getout_time: 1.91252, detection_time: 1.051, total time: 2.984
run time: 0.018, getout_time: 1.94096, detection_time: 1.061, total time: 3.02
run time: 0.02, getout_time: 1.93779, detection_time: 1.053, total time: 3.011
run time: 0.019, getout_time: 1.97773, detection_time: 1.07, total time: 3.067
run time: 0.02, getout_time: 1.93013, detection_time: 1.069, total time: 3.019
run time: 0.018, getout_time: 1.94176, detection_time: 1.101, total time: 3.061
run time: 0.019, getout_time: 1.92462, detection_time: 1.079, total time: 3.023
run time: 0.022, getout_time: 1.9222, detection_time: 1.076, total time: 3.02
run time: 0.021, getout_time: 1.91133, detection_time: 1.048, total time: 2.98
run time: 0.019, getout_time: 1.91965, detection_time: 1.067, total time: 3.006
run time: 0.021, getout_time: 1.92235, detection_time: 1.105, total time: 3.048
run time: 0.021, getout_time: 1.92866, detection_time: 1.161, total time: 3.111
run time: 0.019, getout_time: 1.91241, detection_time: 1.086, total time: 3.017
run time: 0.02, getout_time: 1.93633, detection_time: 1.052, total time: 3.008
run time: 0.019, getout_time: 1.91661, detection_time: 1.073, total time: 3.009
run time: 0.019, getout_time: 1.90531, detection_time: 1.059, total time: 2.983
run time: 0.021, getout_time: 1.92044, detection_time: 1.067, total time: 3.008
run time: 0.038, getout_time: 1.92036, detection_time: 1.042, total time: 3.0
run time: 0.018, getout_time: 1.91986, detection_time: 1.055, total time: 2.993
run time: 0.019, getout_time: 1.95779, detection_time: 1.087, total time: 3.064
run time: 0.019, getout_time: 1.92522, detection_time: 1.068, total time: 3.012
run time: 0.02, getout_time: 1.93092, detection_time: 1.045, total time: 2.996
run time: 0.02, getout_time: 1.91828, detection_time: 1.059, total time: 2.997
run time: 0.019, getout_time: 1.9287, detection_time: 1.081, total time: 3.029
run time: 0.02, getout_time: 1.92668, detection_time: 1.068, total time: 3.015
run time: 0.02, getout_time: 1.94282, detection_time: 1.109, total time: 3.072
run time: 0.02, getout_time: 1.92563, detection_time: 1.064, total time: 3.01
run time: 0.038, getout_time: 1.92589, detection_time: 1.049, total time: 3.013
run time: 0.034, getout_time: 1.92938, detection_time: 1.051, total time: 3.014
run time: 0.018, getout_time: 1.977, detection_time: 1.078, total time: 3.073
run time: 0.018, getout_time: 1.931, detection_time: 1.053, total time: 3.002
run time: 0.02, getout_time: 1.93045, detection_time: 1.049, total time: 2.999
run time: 0.019, getout_time: 1.91125, detection_time: 1.068, total time: 2.998
run time: 0.018, getout_time: 1.93437, detection_time: 1.049, total time: 3.001
run time: 0.031, getout_time: 1.91328, detection_time: 1.065, total time: 3.009
run time: 0.019, getout_time: 1.9211, detection_time: 1.049, total time: 2.989
run time: 0.041, getout_time: 1.90842, detection_time: 1.058, total time: 3.007
run time: 0.036, getout_time: 1.92608, detection_time: 1.055, total time: 3.017
run time: 0.022, getout_time: 1.92604, detection_time: 1.047, total time: 2.995
run time: 0.045, getout_time: 1.92545, detection_time: 1.043, total time: 3.013
run time: 0.035, getout_time: 1.91599, detection_time: 1.026, total time: 2.977
run time: 0.021, getout_time: 1.96068, detection_time: 1.067, total time: 3.049
run time: 0.028, getout_time: 1.92769, detection_time: 1.066, total time: 3.022
run time: 0.035, getout_time: 1.94109, detection_time: 1.036, total time: 3.012
run time: 0.019, getout_time: 1.92071, detection_time: 1.051, total time: 2.991
run time: 0.019, getout_time: 1.92616, detection_time: 1.084, total time: 3.029
run time: 0.028, getout_time: 1.92764, detection_time: 1.061, total time: 3.017
run time: 0.018, getout_time: 1.93644, detection_time: 1.068, total time: 3.022
run time: 0.02, getout_time: 1.95426, detection_time: 1.045, total time: 3.019
run time: 0.045, getout_time: 1.91813, detection_time: 1.029, total time: 2.992
run time: 0.018, getout_time: 1.92311, detection_time: 1.041, total time: 2.982
run time: 0.018, getout_time: 1.91577, detection_time: 1.046, total time: 2.98
run time: 0.038, getout_time: 1.92422, detection_time: 1.048, total time: 3.01
run time: 0.021, getout_time: 1.97752, detection_time: 1.072, total time: 3.071
run time: 0.018, getout_time: 1.9339, detection_time: 1.045, total time: 2.997
run time: 0.02, getout_time: 1.93187, detection_time: 1.046, total time: 2.998
run time: 0.019, getout_time: 1.92066, detection_time: 1.075, total time: 3.015
run time: 0.019, getout_time: 1.92475, detection_time: 1.055, total time: 2.999
run time: 0.043, getout_time: 1.90995, detection_time: 1.047, total time: 3.0
run time: 0.03, getout_time: 1.91711, detection_time: 1.097, total time: 3.044
run time: 0.045, getout_time: 1.9019, detection_time: 1.061, total time: 3.008
run time: 0.037, getout_time: 1.91884, detection_time: 1.081, total time: 3.037
run time: 0.021, getout_time: 1.89945, detection_time: 1.075, total time: 2.995
run time: 0.036, getout_time: 1.92068, detection_time: 1.048, total time: 3.005
run time: 0.029, getout_time: 1.9183, detection_time: 1.053, total time: 3.0
run time: 0.019, getout_time: 1.94357, detection_time: 1.08, total time: 3.043
run time: 0.018, getout_time: 1.90703, detection_time: 1.04, total time: 2.965
run time: 0.018, getout_time: 1.95019, detection_time: 1.06, total time: 3.028
run time: 0.019, getout_time: 1.92096, detection_time: 1.078, total time: 3.018
run time: 0.019, getout_time: 1.92293, detection_time: 1.084, total time: 3.026
run time: 0.02, getout_time: 1.92731, detection_time: 1.099, total time: 3.046
run time: 0.018, getout_time: 1.96959, detection_time: 1.075, total time: 3.063
run time: 0.019, getout_time: 1.92001, detection_time: 1.088, total time: 3.027
run time: 0.018, getout_time: 1.97402, detection_time: 1.12, total time: 3.112
run time: 0.019, getout_time: 1.9416, detection_time: 1.05, total time: 3.011
run time: 0.021, getout_time: 1.92424, detection_time: 1.124, total time: 3.069

according to the output, the run time is mostly stable from second loop and ~0.02s. What is the input size of your mobilenet model? @merrymercy

"tvm.target.mali" is okay. It is the same as "opencl -device=mali".

As the contributor of mali backend, I am pretty sure you are wrong…

run only launches kernels but does not sync the stream.
We only sync the stream when we want to get some data from GPU to CPU. (i.e. get_output to cpu tvm array or call asnumpy on gpu tvm array)

1 Like

Yes. I know you are big contributor of TVM. I already referenced your documents and scripts. Something went wrong here…
You can measure yourself with my script? i already uploaded my models here
Anyway, What is the input size of your mobilenet model?

The output of your script is correct and reasonable.
But the time of run does not mean the time cost of running all opencl kernels.
So your conclusion get_output is time-consuming is not correct.

When you call ‘run’, the CPU puts kernels into a queue but does not wait them to finish.
In the meanwhile GPU fetch commands from the queue and execute them.
When you want to get results back from GPU (where you call get_output), you wait until all commands in that queue finish. So the correct measurement should include run + get_output. In this way we measure the running cost of all kernels.

The benchmark uses 224x224x3 imagenet input

2 Likes

try adding ctx.sync() after run()

Thanks for your clear answer!
@masahi i add this, now the run time is about 1.95s and get_output time is 0.005s

Out of curiosity, is there a reason time evaluator is not being used here?

That is the typical way we usually do performance measurement, as it allows for customizing the number of trials/repetitions and also uses high precision timers.

1 Like

@eqy Yes! Time_evaluator is exactly what i need. Thank you!

@merrymercy Can we sync stream before transferring data from GPU to CPU? This seems necessary for heterogeneous execution. Thanks.

Can ctx.sync() work for you?

I deployed my model (mobilenet_ssd_300x300) with RPC on RK3399 but the processing time is still too bad (1884.47ms, std dev=6.09ms) compare to the recorded result (78.39 ms , std dev=1.33ms). Do you know the reason? @merrymercy

  1. First make sure you can reproduce our benchmark results
    We lock the frequency of mali gpu and close gui by
sudo /etc/init.d/lightdm stop
sudo -i
echo performance > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor
  1. Then you need auto-tuning for your own network.
1 Like

@merrymercy Yes. I successfully reproduced your benchmark result (84.47ms for mobilenet and 181.75ms for resnet-18)
The model that i deployed here doesn’t include multibox_detection part. I removed multibox_prior part of this model and the result seemed to be more reasonable (230.95ms, std dev=8.69ms). So may be current openCL’s schedule for multibox_prior is not good?

I think this is probably not the copy issue. I would suggest you measure each of the workload/op and see if there is a timing problem. My problem was caused the inappropriate handling a certain convolution shape, it was largely solved by PR #1709.

1 Like