I am trying to evaluate convolution performance for int8 on arm devices.
After a lot of tinkering, I was able to :
Build a tflite model with a single quantized convolution
Compile and run the model in TVM
Run the model in tflite on an arm device
Next step for me is to tune my single-op network through the auto-tuner. I was able to start all the pipeline (tracker+devices) and to generate a log file with the configurations.
However, when I try to build the binary through TVM, I receive an error:
assert kernel_layout == "HWIO"
This error stems from the _alter_conv2d_layout.py specialization for arm_cpu. This is indeed forcing a new data_layout OHWI16o on lines 83:96. Is this the expected behaviour?
Because this hits the aforementioned assert in strategy/arm_cpu.py:93 with kernel_layout=OHWI16o
.
If I simply return None from _alter_conv2d_layout everything works fine, but I don’t know if this can affect the performance or even the correctness of the result.
It will be difficult to debug w/o the tuning script and some more information. But, I have some pointers.
Can you print the Relay module/graph? It seems you are using NHWC data layout to start with. This is understandable as TFLite is NHWC by default. For NHWC data layout, kernel layout is HWIO, so you should not have seen that error - assert kernel_layout == "HWIO". We can print out Relay module and check whats the kernel layout.
I have seen NCHW to perform better for ARM edge devices. So, you might benefit by converting the graph to NCHW first using ConvertLayout pass. But this is for getting better performance. Maybe we should first resolve the first error.
By the way, this will be really really useful to have as a test in TVM codebase for TFLite frontend parser. We already do this for many other ops, but conv2d was missing. I tried once earlier, but I couldn’t make it work.
If I use my tuner file, when it goes through the layout alteration pass (conv2d_alter_op.py) it changes the layout from HWIO to OHWIxo (where x in my case is 16), hitting the assert.
Another experiment I did was to remove the assert from strategy/arm_cpu.py. But as I was expecting I am hitting the one in conv2d_spatial_pack_nhwc :
assert len(kernel.shape) == 4, "AlterOpLayout not enabled for NHWC yet"
This surprises me, because I think tflite uses NHWC exactly for performance reasons. Is this because tflite uses gemmlowp while in TVM we use a spatial convolution?
Anyway, I will try to force the NCHW layout to see if at least I can gather some initial performance results.
Finally, if you wish I can share the python code to generate the single tflite quantized convolution. Once you have the tflite file, it is simply a matter of going through the TVM compilation process.
Hi again,
I ran some experiments in NCHW (as suggested by @anijain2305). For the particular convolution I am testing, I see actually a 2x speed-up over tflite (same outputs) with no tuning. Is this expected? I am running one of the conv layers from inception_v3.
I will try to run the whole network and come up with a performance break-down.
However, with the tuning I am not able to get correct results in TVM. What surprises me is that with the tuning enabled TVM selects a winograd compute to run the convolution, which I am not sure is supported for quantized networks. I will try to fiddle around a bit more, then I might start another thread.
For NHWC issue, I was able to reproduce the error on my end
The fix is in the above PR. We made a major refactor around a month ago. And I think we missed this change. For now, I disabled the conversion altogether. There might be a way to improve the NHWC schedule and use the alter_op_layout
Re NCHW, I am surprised to see 2x performance improvements w/o tuning. But, it might be possible for a conv layer. Yes, let’s do full network evaluation. I am also working on pre-quantized models on ARM, so we can join hands.
Regarding winograd, this is interesting. I did not know about accuracy issue. If it helps, we can first disable winograd and get a complete performance picture.
About the NCHW performance: I think that the tophub files are quite good for known conv shapes (which is the case for inception_v3).
About the end-to-end performance evaluation: I ran a quantized version of inception_v3. With TVM I am getting a 16% speed-up over tflite. Is this expected?
Finally, about Winograd. In theory winograd is not supported for quantized, right? At least you don’t convert everything to 32bit (or float) and then requantize in int8 (but then performance might be bad). Disabling Winograd gives the correct result, but slightly slower than the fallback case (although I only ran the tuner for 50 iterations).
Yes, we should definitely sync up on this (especially to set some performance goals).