TVM vs Manual optimization of a fixed CNN network

I need to accelerate the inference of a CNN network (similar to LeNet) for an octa-core Cortex A73 CPU.

Same Inference is invoked a Million times for each 4x4 pixel group in a 16 megapixel image. I am looking forward to manually apply following techniques for optimization:

  1. Perform Inter-layer optimizations by localizing producer-consumer operations in tiles of the images. Network structure is fixed.
  2. Create 8 threads to process 1/8 Million inferences.
  3. Merge multiple inferences in order to achieve maximum throughput from NEON.

TVM can handles the first optimization very efficiently. But for rest of the 2 techniques, manual optimization seems better option.

I am new to TVM and confused to pick between TVM and manual optimization for this specific application.
The only priority for me at present is to achieve the least inference execution time for the entire image. Please guide !!

Have you tried autotuning schedule with autotvm?

What do you mean by “invoked a million times?” Are you referring to a convolution over a single image, or repeating the operations across multiple images?

right. I need to invoke the inferences over entire image with significantly overlapping receptive fields. So, “intra-inference” optimization is one aspect, I also need to merge together many inferences to make better use of cache.

So my initial thought is to group the layers whose parameters can sit in L1 along with image tile (with overlapping receptive fields) and traverse over tile before moving to next set of layers for the same tile.

Have you looked at the existing schedule (templates) in TVM? Many of them may already use variations of the strategies that you bring up here. If you have a graph declaration of your network (e.g., in terms of convolution operations, etc) it will likely be possible to automatically optimize this task.

Is there any example where multiple inferences are fused together? If that is supported in TVM then it will be of great use for me.

Can you describe in more detail what multiple inferences mean here? e.g. you would expect about 1 million different 4x4 pixel regions for a stride-1 4x4 convolution in a 16 MPixel image anyway.

Let me explain a bit more about implementation. I trained a network which reads 16x16 pixels from a 16MP image and predict 4x4 output pixel values. One inference generates 4x4 output (16 pixels). Network is made of 4 Conv and 1FC layers. It needs to predict the output values to generate full resolution image. So inference shall be invoked a million times to generate 16M output pixels.

Ofcourse, the 16x16 input paches will be overlapping for the two neighboring 4x4 blocks. So I want to merge the inference calculations of these two neighbors as their inputs are hugely overlapping. You can also call it Inter-inference optimization.

I think this may be possible in the existing framework if you can fuse the difference inferences into a single model. It seems like you may be able to this with the correct settings of strides for each of the layers.

I see. let me check how can I do that without much increase in trainable parameters.

Thanks so much egy !!

If you use C++ API, you can create multiple threads to do multiple inference in parallel.

@masahi I want to make better utilisation of L1 cache across the overlapping input patches. MT won’t be helping in that.