Benchmark on hikey 960 Mali GPU

foromer4 · February 3, 2019, 10:38am

I am trying to run the mobile_gpu_imagenet_bench.py

script on hikey 960 with MALI.

I am able to run the script , however the performance is worse then when running arm_cpu_imagenet_bench
script. for instance squeezenet takes only ~25 ms on cpu but takes 232 MS in GPU.
what might be the cause for this?

Thanks,
Omer

ehsanmok · February 5, 2019, 1:29am

If your time measurement is right, it’s likely you’ve got to how to make the GPU to CPU memory copy faster?

foromer4 · February 5, 2019, 6:27am

@ehsanmok, I am using the published benchmark, so I don’t see how the measurement can be wrong, nor why do I have cpu to gpu copy issues, assuming the benchmark is right of course.

eqy · February 5, 2019, 10:37pm

I am not sure we have pretuned schedules for the GPU on the HiKey 960 board, so it may be using fallback schedules, which may bring great performance regression (unless you have done tuning on your own). On the other hand, I think the ARM CPU is the same as some other boards (e.g., rpi3b) which would hit pretuned schedules.

If you are hitting the fallback schedules, I would be happy to walk you through the tuning process.

foromer4 · February 6, 2019, 8:03am

@eqy , It would be great to receive your help with this issue, thanks -Omer

eqy · February 6, 2019, 6:58pm

Can you confirm if fallback schedules are being used or not? AutoTVM is very explicit and should display many warning messages in this case.

foromer4 · February 7, 2019, 1:25pm

Not sure how to test it, when I run the benchmark script on my host computer I do not see any warnings. Should I also test for warnings on Hikey? If so - where? thanks

Adding prints from host:
python3 mobile_gpu_imagenet_bench.py --rpc-key hikey --dtype float16
Network Name Mean Inference Time (std dev)
squeezenet_v1.1 93.12 ms (6.10 ms)
mobilenet 84.94 ms (4.12 ms)
resnet-18 259.28 ms (6.66 ms)

From device:
server --tracker=10.200.2.123:9190 --key=hikey
INFO:root:If you are running ROCM/Metal, fork will cause compiler internal error. Try to launch with arg --no-fork
INFO:RPCServer:bind to 0.0.0.0:9090
INFO:RPCServer:connection from (‘10.200.2.123’, 42034)
INFO:RPCServer:load_module /tmp/tmpjv7vse6k/squeezenet_v1.1.tar
INFO:RPCServer:Finish serving (‘10.200.2.123’, 42034)
INFO:RPCServer:connection from (‘10.200.2.123’, 42038)
INFO:RPCServer:load_module /tmp/tmp_n06g3tm/mobilenet.tar
INFO:RPCServer:Finish serving (‘10.200.2.123’, 42038)
INFO:RPCServer:connection from (‘10.200.2.123’, 42042)
INFO:RPCServer:load_module /tmp/tmpsvzycigd/resnet-18.tar
INFO:RPCServer:Finish serving (‘10.200.2.123’, 42042)
INFO:RPCServer:connection from (‘10.200.2.123’, 42048)
INFO:RPCServer:load_module /tmp/tmp1yb0g1qt/vgg-16.tar
…

eqy · February 7, 2019, 8:21pm

Okay, this makes sense as the benchmark script uses a rk3399 by default if you do not specify one. Since the rk3399 has many pretuned workloads, you probably won’t see a warning.

Try changing the target to something like hikey960, and the warnings should appear. You can try this script to autotune for your board.

foromer4 · February 10, 2019, 8:50am

Hi, In the script there was no option for any other model other than rk3399, I added to the script and tested: hikey, hikey960 , but both fails.Here is what I get:

python3 mobile_gpu_imagenet_bench.py --model hikey960 --rpc-key hikey
Network Name Mean Inference Time (std dev)
Traceback (most recent call last):
File “mobile_gpu_imagenet_bench.py”, line 88, in
evaluate_network(network, target, target_host, args.dtype, args.repeat)
File “mobile_gpu_imagenet_bench.py”, line 44, in evaluate_network
rlib = remote.load_module(filename)
File “/home/omer/devel/git/ext_repos/tvm/python/tvm/rpc/client.py”, line 132, in load_module
return base._LoadRemoteModule(self._sess, path)
File “/home/omer/devel/git/ext_repos/tvm/python/tvm/_ffi/_ctypes/function.py”, line 185, in call
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File “/home/omer/devel/git/ext_repos/tvm/python/tvm/_ffi/base.py”, line 72, in check_call
raise TVMError(py_str(_LIB.TVMGetLastError()))
tvm._ffi.base.TVMError: Except caught from RPC call: TVMCall CFunc Error:
Traceback (most recent call last):
File “/home/linaro/devel/git/tvm/python/tvm/_ffi/_ctypes/function.py”, line 55, in cfun
rv = local_pyfunc(*pyargs)
File “/home/linaro/devel/git/tvm/python/tvm/rpc/server.py”, line 50, in load_module
m = _load_module(path)
File “/home/linaro/devel/git/tvm/python/tvm/module.py”, line 241, in load
_cc.create_shared(path + “.so”, files)
File “/home/linaro/devel/git/tvm/python/tvm/contrib/cc.py”, line 33, in create_shared
_linux_shared(output, objects, options, cc)
File “/home/linaro/devel/git/tvm/python/tvm/contrib/cc.py”, line 58, in _linux_shared
raise RuntimeError(msg)
RuntimeError: Compilation error:
/usr/bin/ld: /tmp/tmpw7m7jho8/lib.o: Relocations in generic ELF (EM: 62)
/usr/bin/ld: /tmp/tmpw7m7jho8/lib.o: Relocations in generic ELF (EM: 62)
/tmp/tmpw7m7jho8/lib.o: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status

foromer4 · February 11, 2019, 4:43pm

Also, adding some info,
the link you sent isn’t working, but I found this -

and ran it (this is with n_trials=100)
this is what I get:

key is hikey
Extract tasks…
will work on mobilenet
Tuning…
[Task 1/20] Current/Best: [Task 2/20] Current/Best: [Task 3/20] Current/Best: [Task 4/20] Current/Best: [Task 5/20] Current/Best: [Task 6/20] Current/Best: [Task 7/20] Current/Best: [Task 8/20] Current/Best: [Task 9/20] Current/Best: [Task 10/20] Current/Best: [Task 11/20] Current/Best: [Task 12/20] Current/Best: [Task 13/20] Current/Best: [Task 14/20] Current/Best: [Task 15/20] Current/Best: [Task 16/20] Current/Best: [Task 17/20] Current/Best: [Task 18/20] Current/Best: [Task 19/20] Current/Best: [Task 20/20] Current/Best: Compile…
Upload…
Evaluate inference time cost…
Mean inference time 2.74/ 2.80 GFLOPS | Progress: (100/100) | 205.28 s Done.
11.22/ 11.47 GFLOPS | Progress: (100/100) | 392.94 s Done.
1.65/ 2.52 GFLOPS | Progress: (100/100) | 100.83 s Done.
12.11/ 16.14 GFLOPS | Progress: (100/100) | 242.84 s Done.
0.23/ 1.84 GFLOPS | Progress: (100/100) | 107.47 s Done.
4.36/ 7.65 GFLOPS | Progress: (100/100) | 456.12 s Done.
0.93/ 3.49 GFLOPS | Progress: (100/100) | 175.89 s Done.
5.49/ 9.59 GFLOPS | Progress: (100/100) | 385.73 s Done.
0.67/ 1.72 GFLOPS | Progress: (100/100) | 155.50 s Done.
5.14/ 18.34 GFLOPS | Progress: (100/100) | 491.44 s Done.
3.36/ 3.94 GFLOPS | Progress: (100/100) | 206.38 s Done.
12.02/ 15.60 GFLOPS | Progress: (100/100) | 283.43 s Done.
1.25/ 1.79 GFLOPS | Progress: (100/100) | 169.96 s Done.
4.71/ 15.23 GFLOPS | Progress: (100/100) | 277.51 s Done.
2.54/ 5.50 GFLOPS | Progress: (100/100) | 223.71 s Done.
9.27/ 21.75 GFLOPS | Progress: (100/100) | 274.30 s Done.
2.02/ 2.20 GFLOPS | Progress: (100/100) | 228.14 s Done.
12.11/ 18.86 GFLOPS | Progress: (100/100) | 299.72 s Done.
2.74/ 4.31 GFLOPS | Progress: (100/100) | 272.48 s Done.
2.66/ 13.36 GFLOPS | Progress: (100/100) | 322.90 s Done.
(std dev): 118.70 ms (1.99 ms)

Still slow, I am trying to run with more trials but Hikey crashes in the middle.
Once I get a good setting, how do I save it on Hikey so that it will know to use it by default?

Thank you

eqy · February 11, 2019, 6:39pm

Yes, sorry, that was the link that I was meaning to send.
After tuning you can add the logs to the mali_v0.04.log in the tophub directory in your ~/.tvm. After this the best configuration will be used as long as you specify the same target that did you tuning with.

One of the issues that may affect performance on the hikey 960 is thermal throttling, do you have an active cooling solution for the board?