Hi all,
I am giving a try to improve NHWC quantized performance for depthwise convolution on AArch64 targets.
Background
In this process, I convert everything to int16
and then I split the loop on the channels by 8, and then vectorize
the inner loop.
So, the loop on the channels is similar to:
for (c = 0; c < channels; c+=8){
depthwise_conv2d_nhwc_output_2[c:c+8] += X[c:c+8]*W[c:c+8]
}
Question
If we consider the previous expression for c=0
, I am looking for a way to convert the following TIR expression:
depthwise_conv2d_nhwc_output_2[0:8] += X[0:8]*W[0:8]
In the following assembly:
smlal %[acc_a].4s, %[w].4h, %[x].4h
smlal2 %[acc_b].4s, %[w].8h, %[x].8h
str %[acc_a], [%[out]]
str %[acc_b], [%[out, #4]]
Is that possible without using tensorization?