I need to accelerate the inference of a CNN network (similar to LeNet) for an octa-core Cortex A73 CPU.
Same Inference is invoked a Million times for each 4x4 pixel group in a 16 megapixel image. I am looking forward to manually apply following techniques for optimization:
- Perform Inter-layer optimizations by localizing producer-consumer operations in tiles of the images. Network structure is fixed.
- Create 8 threads to process 1/8 Million inferences.
- Merge multiple inferences in order to achieve maximum throughput from NEON.
TVM can handles the first optimization very efficiently. But for rest of the 2 techniques, manual optimization seems better option.
I am new to TVM and confused to pick between TVM and manual optimization for this specific application.
The only priority for me at present is to achieve the least inference execution time for the entire image. Please guide !!