Very high CPU utilization of TVM inference vs. TensorFlow Lite

johnmattew490 · November 8, 2019, 1:35am

Hi All,

I am observing 100% CPU utilization of ARM device when I run my inference generated with TVM. On the other hand, I am getting only 50% CPU utilization when I do the same inference with TFlite.

Both TVM and TFlite they achieve the same FPS. However, since TVM uses more CPU, I expect TVM to be faster.

Any suggestions how to reduce the TVM CPU Utilization and still achieve the same speed with TFlite?

MartynBliss · November 8, 2019, 9:19am

How many threads are being used in both cases?

johnmattew490 · November 8, 2019, 4:40pm

Both of them are running on the same device. Both of them is using 3 threads.

janimesh · November 8, 2019, 7:57pm

Not sure if I understand the problem correctly. But, if you are using “top” to measure the utilization, it might not be good metric. You could have 100% utilization, but you might not be using your vector units efficiently.

What do you use to measure util? If “top” it should mean, that different threads are sharing same CPU.

johnmattew490 · November 8, 2019, 10:54pm

Thank you Janimesh.

I am using “top”, and all my cores are 100% utilized.

Is it possible that TVM generates a code that does not EFFICIENTLY use SIMD units?
If so, how can I debug it?
Is it possible to know which stage of my network takes most of time and CPU resource? What tools are available?

Thanks in advance.

janimesh · November 9, 2019, 1:38am

Is it possible that TVM generates a code that does not EFFICIENTLY use SIMD units?

Very highly possible We write schedules that then use LLVM for codegen. There are many places to mess up

If so, how can I debug it?

Typical way to debug any performance bug is to first understand the limits (peak throughput) of your device. Understand which instructions should show up in assmembly. Double-check if assembly looks somewhat reasonable. Atleast thats how I debug.

Is it possible to know which stage of my network takes most of time and CPU resource? What tools are available?

Yes, there is debug_runtime that can give you op-by-op breakdown.

johnmattew490 · November 9, 2019, 2:00am

Thank you janimesh.

Only two clarification:

I guess this (https://docs.tvm.ai/dev/debugger.html#how-to-use-debugger) is the tutorial how to use debug_runtime?
I apologize for my questions, and I am new to TVM. At what level of “TVM+LLVM” compiler generated code pass, I could parse the intermediate instructions to verify if the generated code uses ARM NEON intrinsics? Is there any kind of tutorials? All I want to do is to verify TVM generated code uses ARM NEON intrinsics.