Remote Time Evaluator way to large

max1996 · July 17, 2020, 1:34pm

Hi there, I tried to benchmark multiple deep learning compilers on the raspberry pi 4. For most of them I compiled them into standalone bundles. But I was unable to achieve this with TVM after Auto-Tuning or Quantization (do not know why).

But I was told, that I should use the remote runtime module time_evaluator function. Problem is: The output is 10x higher than what has been measured previously.

In the standalones only the raw inference time has been measured and from the documentation of TVM it looks like, that should be the case for the runtime as well. But I get 6ms instead 0.8ms per inference.

I found the Issue on GitHub, which seems to address wrong time_evaluator outputs, but ARM CPUs should be fixed and the problem was to small outputs. (https://github.com/apache/incubator-tvm/issues/5215)

Why is the result so different? And is there a way to deploy to standalone without the runtime? (see AutoTVM or Quantization for Raspberry Pi fails)

FrozenGene · July 19, 2020, 12:43am

Not understand fully about 6ms vs 0.8ms. Could you explain a bit more clear and paste the time log (like auto tuning time log / per layer time log ( you could turn on debug graph time and use it to get) )?

max1996 · July 20, 2020, 6:45am

The 0.8ms have been achieved using an non auto-tuned floating point version of my network inside a TVM standalone (see Bundle Deploy - Github).

As I was unable to use the same approach with a quantized or auto-tuned version of the network (see Other Issue regarding this problem), I got the recommendation to use the RPC-API to do my benchmarking (GitHub Issue).

I tried the RPC time_evaluator function and got 6ms for the same network.

(I will add the time logs as soon as I can)

max1996 · July 20, 2020, 8:37am

{"input": ["llvm -device=arm_cpu -target=arm-linux-gnu", "conv2d_NCHWc.x86", [["TENSOR", [1, 1, 28, 28], "float32"], ["TENSOR", [8, 1, 5, 5], "float32"], [1, 1], [2, 2, 2, 2], [1, 1], "NCHW", "NCHW", "float32"], {}], "config": {"index": 0, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 1]], ["tile_oc", "sp", [-1, 1]], ["tile_ow", "sp", [-1, 1]], ["unroll_kw", "ot", true]]}, "result": [[0.0015573943999999998], 0, 3.8097798824310303, 1595230936.2957578], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["llvm -device=arm_cpu -target=arm-linux-gnu", "conv2d_nchw_spatial_pack.arm_cpu", [["TENSOR", [1, 1, 28, 28], "float32"], ["TENSOR", [8, 1, 5, 5], "float32"], [1, 1], [2, 2, 2, 2], [1, 1], "float32"], {}], "config": {"index": 2445, "code_hash": null, "entity": [["tile_co", "sp", [-1, 2]], ["tile_oh", "sp", [-1, 28]], ["tile_ow", "sp", [-1, 28]], ["reorder_0", "re", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], ["ann_reduce", "an", ["none", "none"]], ["ann_spatial", "an", ["none", "none", "vec"]]]}, "result": [[0.0028908516], 0, 3.4126908779144287, 1595231548.4120507], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["llvm -device=arm_cpu -target=arm-linux-gnu", "conv2d_NCHWc.x86", [["TENSOR", [1, 8, 14, 14], "float32"], ["TENSOR", [16, 8, 5, 5], "float32"], [1, 1], [2, 2, 2, 2], [1, 1], "NCHW", "NCHW", "float32"], {}], "config": {"index": 185, "code_hash": null, "entity": [["tile_ic", "sp", [-1, 2]], ["tile_oc", "sp", [-1, 2]], ["tile_ow", "sp", [-1, 7]], ["unroll_kw", "ot", false]]}, "result": [[0.0033298216], 0, 0.5387725830078125, 1595232164.824161], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["llvm -device=arm_cpu -target=arm-linux-gnu", "conv2d_nchw_spatial_pack.arm_cpu", [["TENSOR", [1, 8, 14, 14], "float32"], ["TENSOR", [16, 8, 5, 5], "float32"], [1, 1], [2, 2, 2, 2], [1, 1], "float32"], {}], "config": {"index": 631, "code_hash": null, "entity": [["tile_co", "sp", [-1, 2]], ["tile_oh", "sp", [-1, 7]], ["tile_ow", "sp", [-1, 14]], ["reorder_0", "re", [0, 1, 2, 3, 4, 5, 6, 9, 7, 8]], ["ann_reduce", "an", ["unroll", "unroll"]], ["ann_spatial", "an", ["none", "none", "none"]]]}, "result": [[0.0031675936], 0, 12.180986404418945, 1594992948.6131058], "version": 0.2, "tvm_version": "0.7.dev1"}

Here is the missing log file for basic AutoTVM

FrozenGene · July 20, 2020, 10:15am

According to your tuning log, the timing is not measured wrongly. You could see the result column: result": [[0.0015573943999999998], which says we will execute 1.5ms. The network you compared with is the same? I mean we couldn’t compare floating point network v.s. quantized network performance. Another question is how to get 0.8ms?

max1996 · July 20, 2020, 10:56am

ok, thank you. Now I understand the output a bit better, but why does it say 1.5ms here and the time_evaluator returns 6ms?

I got the 0.8ms by checking system time before and after execution of one inference. The C function should be called getdatetime (or something similar). I can attach the C code as well, if you want to see it.

max1996 · July 20, 2020, 11:24am

ok, my test run with quantization&AutoTVM and quantization without auto-tuning just finished, an it gets even more stranger:

float, AutoTVM, time_evaluator = 6ms
float, bundle, C time measure = 0.8ms
quant, AutoTVM, time_evalu = 0.41ms
quant, non AutoTVM, time_eval = 0.47ms
float, none AutoTVM, time_eval = still evaluating (a couple of minutes now for 100,000 runs, quant took ~60 seconds)

It is obvious, that I have made a mistake, but I am not sure where…

(time in ms, leftmost results are measured in the C bundle)

FrozenGene · July 21, 2020, 2:32am

1.5ms is this layer’s performance : [1, 1, 28, 28], "float32"], ["TENSOR", [8, 1, 5, 5], "float32"],. And we must know this is not the whole network’s performance. Besides this, we maybe have layout transform / other ops

FrozenGene · July 21, 2020, 2:32am

I am curious C bundle way how to measure.

max1996 · July 21, 2020, 6:30am

I am using the gettimeofday function, provided by <sys/time.h>

gettimeofday(&t_start, 0);

tvm_runtime_run(handle);

gettimeofday(&t_end, 0);

I am still confused about my AutoTVM results, in the post above…

FrozenGene · July 21, 2020, 6:32am

Maybe you have calculated wrong time duration? for example, it is 8ms, not 0.8ms?

max1996 · July 21, 2020, 6:38am

well, I do it the same way as the example, but only with the inference.

the calculation is

(t_end.tv_sec - t_start.tv_sec) * 1000 + (t_end.tv_usec - t_start.tv_usec) / 1000.f

And the result fits very well into the results of other compilers, which I have tested