TFLite and TVM comparison for quantized models

Just want to share the performance for TFLite and TVM for TFLite pre-quantized models (models that have already been quantized using Tensorflow/TFLite). TFLite is faster than TFLite for now. This thread can be used to discuss possible improvements and we can keep updating the numbers as we come up with better schedules or Relay optimizations.

Setup - Rasp4b - Both TVM and TFLite are running with 4 threads. TVM kernels have already been tuned.

Image classification models

Models TFLite int8 (ms) TVM Int8 (ms) Speedup over TFLite
inception_v1 99.97 106.52 0.93851
inception_v2 132 146.48 0.90115
inception_v3 333 412.72 0.80684
inception_v4 700 896.02 0.78123
mobilenet_v1 35.59 46.96 0.75788
mobilenet_v2 33.08 34.42 0.96107

Object detection models

Model name TFLite int8 (ms) TVM int8 (ms) Speedup over TFlite
ssd_coco_quantized 56 99.61 0.56219

@FrozenGene @giuseros @ramana-arm @tqchen @jwfromm @thierry

If we are slow on 4 threads, I think on 1 thread, we will slow more compared with TFLite.

I think the reason has been discussed before some times, especially in the @jackwish’s this answer: TF Lite quantized conv2d operator conversion Your performance number is almost the same as our initial quantized performance (although we only just record Mobilenet V1 / V2). @jackwish’s share is our development experience how to improve performance. I could almost make sure the performance reason is the intermediate memory access. So as @jackwish share

If we break these two steps into two operators, the first reads INT8 writes INT32 into memory and the second reads INT32 and writes INT8. In our test, this approach showed significant performance drop. (I’d like to make it clear that we have tensorized step 1 which may prevent step 2 from fusion.) As soon as we merged them into one in tensorize micro kernel, we got basically same performance as QNNPACK. The difference here is if there is INT32 intermedia memory access in the operator, if the computing is merged, the INT32 intermedia result (the accumulated result) can serve in registers.

So, to summarize, regarding @janimesh 's proposals, I think option 1 may get performance similar to TFLite, while option 2 is more capable of enabling powerful tensorize design.

This is why we create one operator named as q_conv2d to complete all work.

To reduce the cache miss, we also change the layout to NHWC we have talked some times. This is really better for CPU compared with NCHW. (talked in

However, I don’t have upstream our implementation is I wish our coming auto scheduler (very soon) could help us complete some work (like layout) and we could contribute our implementation on it. I don’t want to let you guys think that we don’t want to contribute, so I want to make some explain.

Thanks @FrozenGene I agree. We need better schedule. Currently, I am using NCHWc, which is better than NCHW, but might be slower than NHWC. Another major improvement should come from tensorization. Currently we are relying on LLVM.

For the Int8/Int32 memory bandwidth issue, this is already happening because of the Relay fusion. Currently conv is fused with 7-8 ops after it, basically fusing conv2d + requantize. I think we can further micro-optimize this, but the structure is already there. We would not need any new Relay/TVM feature.

Looking forward to your auto-scheduler work. And I hope it can help int8 schedules as well.