If we are slow on 4 threads, I think on 1 thread, we will slow more compared with TFLite.
I think the reason has been discussed before some times, especially in the @jackwish’s this answer: TF Lite quantized conv2d operator conversion Your performance number is almost the same as our initial quantized performance (although we only just record Mobilenet V1 / V2). @jackwish’s share is our development experience how to improve performance. I could almost make sure the performance reason is the intermediate memory access. So as @jackwish share
If we break these two steps into two operators, the first reads INT8 writes INT32 into memory and the second reads INT32 and writes INT8. In our test, this approach showed significant performance drop. (I’d like to make it clear that we have tensorized step 1 which may prevent step 2 from fusion.) As soon as we merged them into one in tensorize micro kernel, we got basically same performance as QNNPACK. The difference here is if there is INT32 intermedia memory access in the operator, if the computing is merged, the INT32 intermedia result (the accumulated result) can serve in registers.
So, to summarize, regarding @janimesh 's proposals, I think option 1 may get performance similar to TFLite, while option 2 is more capable of enabling powerful tensorize design.
This is why we create one operator named as q_conv2d
to complete all work.
To reduce the cache miss, we also change the layout to NHWC we have talked some times. This is really better for CPU compared with NCHW. (talked in https://github.com/apache/incubator-tvm-site/pull/2/).
However, I don’t have upstream our implementation is I wish our coming auto scheduler (very soon) could help us complete some work (like layout) and we could contribute our implementation on it. I don’t want to let you guys think that we don’t want to contribute, so I want to make some explain.