TFLite and TVM comparison for quantized models

Just want to share the performance for TFLite and TVM for TFLite pre-quantized models (models that have already been quantized using Tensorflow/TFLite). TFLite is faster than TFLite for now. This thread can be used to discuss possible improvements and we can keep updating the numbers as we come up with better schedules or Relay optimizations.

Setup - Rasp4b - Both TVM and TFLite are running with 4 threads. TVM kernels have already been tuned.

Image classification models

Models TFLite int8 (ms) TVM Int8 (ms) Speedup over TFLite
inception_v1 99.97 106.52 0.93851
inception_v2 132 146.48 0.90115
inception_v3 333 412.72 0.80684
inception_v4 700 896.02 0.78123
mobilenet_v1 35.59 46.96 0.75788
mobilenet_v2 33.08 34.42 0.96107

Object detection models

Model name TFLite int8 (ms) TVM int8 (ms) Speedup over TFlite
ssd_coco_quantized 56 99.61 0.56219

@FrozenGene @giuseros @ramana-arm @tqchen @jwfromm @thierry

1 Like

If we are slow on 4 threads, I think on 1 thread, we will slow more compared with TFLite.

I think the reason has been discussed before some times, especially in the @jackwish’s this answer: TF Lite quantized conv2d operator conversion Your performance number is almost the same as our initial quantized performance (although we only just record Mobilenet V1 / V2). @jackwish’s share is our development experience how to improve performance. I could almost make sure the performance reason is the intermediate memory access. So as @jackwish share

If we break these two steps into two operators, the first reads INT8 writes INT32 into memory and the second reads INT32 and writes INT8. In our test, this approach showed significant performance drop. (I’d like to make it clear that we have tensorized step 1 which may prevent step 2 from fusion.) As soon as we merged them into one in tensorize micro kernel, we got basically same performance as QNNPACK. The difference here is if there is INT32 intermedia memory access in the operator, if the computing is merged, the INT32 intermedia result (the accumulated result) can serve in registers.

So, to summarize, regarding @janimesh 's proposals, I think option 1 may get performance similar to TFLite, while option 2 is more capable of enabling powerful tensorize design.

This is why we create one operator named as q_conv2d to complete all work.

To reduce the cache miss, we also change the layout to NHWC we have talked some times. This is really better for CPU compared with NCHW. (talked in https://github.com/apache/incubator-tvm-site/pull/2/).

However, I don’t have upstream our implementation is I wish our coming auto scheduler (very soon) could help us complete some work (like layout) and we could contribute our implementation on it. I don’t want to let you guys think that we don’t want to contribute, so I want to make some explain.

1 Like

Thanks @FrozenGene I agree. We need better schedule. Currently, I am using NCHWc, which is better than NCHW, but might be slower than NHWC. Another major improvement should come from tensorization. Currently we are relying on LLVM.

For the Int8/Int32 memory bandwidth issue, this is already happening because of the Relay fusion. Currently conv is fused with 7-8 ops after it, basically fusing conv2d + requantize. I think we can further micro-optimize this, but the structure is already there. We would not need any new Relay/TVM feature.

Looking forward to your auto-scheduler work. And I hope it can help int8 schedules as well.

@anijain2305 As we have delivered work to OSDI and are rebasing our code with the latest master (wish we could start to bring it in this month), I want to share one quick data for you. One rasp 3B+, mobilenet v2 quantized model, TFLite 2.1 is 53.839ms(compared with TFLite 1.14, it has big improvement), Auto TVM is 76.08ms,However auto schedule is 43.53ms, it is 1.2x compared with tflite. In fact, we still have room to improve (reduce load instruction), but I think it is a good start.

2 Likes

This is very good news. Looking forward to your work in TVM codebase.

How generic is the auto-scheduler? Is it mainly for conv2d or for arbitrary TVM ops?

arbitrary tvm ops. If you have tvm.compute, we will generate high performance schedule automatically for you.

Sounds awesome! Looking forward to the PR :slight_smile:

HI,do you find tflite quant model will get different result between run tflite and run TVM?

models from https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet

@anijain2305

@henry099 Yes, we have seen TFlite and TVM outputs to differ slightly. This is due to differences in TFlite and TVM compute (rounding and maybe some other differences). However, we have observed that these minor differences have minimal effect on application accuracy (Top1/Top5).

@anijain2305 Thanks replying.As a TVM beginner,I think quant model should be INT type compute,where will get rounding error?I had tried replace TVM’s Multiplier as same with TFLITE,but result still different.Any other clues to try?

I had try like mobilenet_0.25_128/96,top1/top5 will get more effect

I have not tried mobilnet_0.25. I tried original mobilenet v1 and v2 and got good results.

Yes, the quantized convolutions are integer datatype, but we have to call requantize operator frequently to adjust quantization parameters. The requantize requires a fixed point multiplication, and hence a rounding.

Hi folks, has the performance diff been addressed? OctoML has the capability to compile models and measure latency with both TVM and TFLite. Do we have a similar effort/capability?