Improving Depthwise convolution core loop

Hi all,

I am giving a try to improve NHWC quantized performance for depthwise convolution on AArch64 targets.

Background

In this process, I convert everything to int16 and then I split the loop on the channels by 8, and then vectorize the inner loop.

So, the loop on the channels is similar to:

for (c = 0; c < channels; c+=8){
    depthwise_conv2d_nhwc_output_2[c:c+8] += X[c:c+8]*W[c:c+8]
}

Question

If we consider the previous expression for c=0, I am looking for a way to convert the following TIR expression:

depthwise_conv2d_nhwc_output_2[0:8] += X[0:8]*W[0:8]

In the following assembly:

smlal  %[acc_a].4s, %[w].4h, %[x].4h
smlal2 %[acc_b].4s, %[w].8h, %[x].8h
str %[acc_a], [%[out]]
str %[acc_b], [%[out, #4]]

Is that possible without using tensorization?

cc: @anijain2305 @ramana-arm @matt-arm

@FrozenGene You might be interested in this. Have you encountered this type of situation?

@vinx13 if you are interested

if tensorization is not used, the codegen rule would be decided by LLVM

I haven’t focused in the inner loop, I think if we don’t use tensorize we could hardly achieve this pattern instruction.