Issue in alter_conv2d_layout for arm_cpu

Hi all,

I am trying to evaluate convolution performance for int8 on arm devices.

After a lot of tinkering, I was able to :

  • Build a tflite model with a single quantized convolution
  • Compile and run the model in TVM
  • Run the model in tflite on an arm device

Next step for me is to tune my single-op network through the auto-tuner. I was able to start all the pipeline (tracker+devices) and to generate a log file with the configurations.

However, when I try to build the binary through TVM, I receive an error:

assert kernel_layout == "HWIO"

This error stems from the _alter_conv2d_layout.py specialization for arm_cpu. This is indeed forcing a new data_layout OHWI16o on lines 83:96. Is this the expected behaviour?

Because this hits the aforementioned assert in strategy/arm_cpu.py:93 with kernel_layout=OHWI16o .

If I simply return None from _alter_conv2d_layout everything works fine, but I don’t know if this can affect the performance or even the correctness of the result.

Could someone shed some light on this?

Thanks,

Giuseppe

Digging a bit more, I found out that in topi/arm_cpu/conv2d_spatial_pack.py there is the following assert:

assert len(kernel.shape) == 4, "AlterOpLayout not enabled for NHWC yet"

Please, correct if I am wrong: does this mean that the layout alteration for the kernel shape (from HWIO to OHWIxo) is not implemented yet?

Thanks again,

Giuseppe

@janimesh - this maybe of interest to you.

Hi Giuseppe,

It will be difficult to debug w/o the tuning script and some more information. But, I have some pointers.

  1. Can you print the Relay module/graph? It seems you are using NHWC data layout to start with. This is understandable as TFLite is NHWC by default. For NHWC data layout, kernel layout is HWIO, so you should not have seen that error - assert kernel_layout == "HWIO". We can print out Relay module and check whats the kernel layout.

  2. I have seen NCHW to perform better for ARM edge devices. So, you might benefit by converting the graph to NCHW first using ConvertLayout pass. But this is for getting better performance. Maybe we should first resolve the first error.

By the way, this will be really really useful to have as a test in TVM codebase for TFLite frontend parser. We already do this for many other ops, but conv2d was missing. I tried once earlier, but I couldn’t make it work.

Hi Animesh, Ramana,

Thanks for your replies.

This is the Relay module before it gets compiled:

def @main(%input: Tensor[(1, 147, 147, 32), uint8], %v_param_1: Tensor[(3, 3, 32, 64), uint8], %v_param_2: Tensor[(64), int32]) -> Tensor[(1, 147, 147, 64), uint8] {

  %0 = nn.pad(%input, pad_value=128f, pad_width=[[0, 0], [1, 1], [1, 1], [0, 0]]) /* ty=Tensor[(1, 149, 149, 32), uint8] */;

  %1 = qnn.conv2d(%0, %v_param_1, 128 /* ty=int32 */, 109 /* ty=int32 */, 0.00784314f /* ty=float32 */, 0.0054902f /* ty=float32 */, padding=[0, 0, 0, 0], channels=64, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO", out_dtype="int32") /* ty=Tensor[(1, 147, 147, 64), int32] */;
  
  %2 = nn.bias_add(%1, %v_param_2, axis=3) /* ty=Tensor[(1, 147, 147, 64), int32] */;
 
  qnn.requantize(%2, 4.30604e-05f /* ty=float32 */, 0 /* ty=int32 */, 0.027451f /* ty=float32 */, 36 /* ty=int32 */, out_dtype="uint8") /* ty=Tensor[(1, 147, 147, 64), uint8] */
}

This will go through some legalization passes (layout alteration is one of them) before generating the schedule to lower.

If I don’t use my tuner file, the legalization results for conv2d is:

%0 = nn.conv2d(%p0, %p1, padding=[0, 0, 0, 0], channels=64, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO", out_dtype="int32") /* ty=Tensor[(1, 147, 147, 64), int32] */;

Which indeed is fine and works without issues.

If I use my tuner file, when it goes through the layout alteration pass (conv2d_alter_op.py) it changes the layout from HWIO to OHWIxo (where x in my case is 16), hitting the assert.

Another experiment I did was to remove the assert from strategy/arm_cpu.py. But as I was expecting I am hitting the one in conv2d_spatial_pack_nhwc :

assert len(kernel.shape) == 4, "AlterOpLayout not enabled for NHWC yet"

This surprises me, because I think tflite uses NHWC exactly for performance reasons. Is this because tflite uses gemmlowp while in TVM we use a spatial convolution?

Anyway, I will try to force the NCHW layout to see if at least I can gather some initial performance results.

Finally, if you wish I can share the python code to generate the single tflite quantized convolution. Once you have the tflite file, it is simply a matter of going through the TVM compilation process.

Thanks a lot,

Giuseppe

Hi again, I ran some experiments in NCHW (as suggested by @anijain2305). For the particular convolution I am testing, I see actually a 2x speed-up over tflite (same outputs) with no tuning. Is this expected? I am running one of the conv layers from inception_v3.

I will try to run the whole network and come up with a performance break-down.

However, with the tuning I am not able to get correct results in TVM. What surprises me is that with the tuning enabled TVM selects a winograd compute to run the convolution, which I am not sure is supported for quantized networks. I will try to fiddle around a bit more, then I might start another thread.

Thanks,

Giuseppe

For NHWC issue, I was able to reproduce the error on my end

The fix is in the above PR. We made a major refactor around a month ago. And I think we missed this change. For now, I disabled the conversion altogether. There might be a way to improve the NHWC schedule and use the alter_op_layout


Re NCHW, I am surprised to see 2x performance improvements w/o tuning. But, it might be possible for a conv layer. Yes, let’s do full network evaluation. I am also working on pre-quantized models on ARM, so we can join hands.

Regarding winograd, this is interesting. I did not know about accuracy issue. If it helps, we can first disable winograd and get a complete performance picture.

About the NHWC error: great, thanks for the PR!

About the NCHW performance: I think that the tophub files are quite good for known conv shapes (which is the case for inception_v3).

About the end-to-end performance evaluation: I ran a quantized version of inception_v3. With TVM I am getting a 16% speed-up over tflite. Is this expected?

Finally, about Winograd. In theory winograd is not supported for quantized, right? At least you don’t convert everything to 32bit (or float) and then requantize in int8 (but then performance might be bad). Disabling Winograd gives the correct result, but slightly slower than the fallback case (although I only ran the tuner for 50 iterations).

Yes, we should definitely sync up on this (especially to set some performance goals).

Thanks again,

Giuseppe