Improving Depthwise convolution core loop

giuseros · July 9, 2020, 5:14pm

Hi all,

I am giving a try to improve NHWC quantized performance for depthwise convolution on AArch64 targets.

Background

In this process, I convert everything to int16 and then I split the loop on the channels by 8, and then vectorize the inner loop.

So, the loop on the channels is similar to:

for (c = 0; c < channels; c+=8){
    depthwise_conv2d_nhwc_output_2[c:c+8] += X[c:c+8]*W[c:c+8]
}

If we consider the previous expression for c=0, I am looking for a way to convert the following TIR expression:

depthwise_conv2d_nhwc_output_2[0:8] += X[0:8]*W[0:8]

In the following assembly:

smlal  %[acc_a].4s, %[w].4h, %[x].4h
smlal2 %[acc_b].4s, %[w].8h, %[x].8h
str %[acc_a], [%[out]]
str %[acc_b], [%[out, #4]]

Is that possible without using tensorization?

giuseros · July 9, 2020, 5:14pm

anijain2305 · July 9, 2020, 5:43pm

@FrozenGene You might be interested in this. Have you encountered this type of situation?

junrushao · July 9, 2020, 6:30pm

@vinx13 if you are interested

vinx13 · July 9, 2020, 6:38pm

if tensorization is not used, the codegen rule would be decided by LLVM

FrozenGene · July 13, 2020, 6:26am

I haven’t focused in the inner loop, I think if we don’t use tensorize we could hardly achieve this pattern instruction.