AutoTVM and CPU vectorization: should I split?

moderato · February 6, 2020, 9:38pm

Hello! Suppose my CPU supports AVX2 which supports operations with 256-bit registers (8 FP32 operands). Does that mean in AutoTVM we can always config like

# (suppose the length of x is 32)
xo, xi = s[A].split(x, factors=8)
s[A].unroll(xo)
s[A].vectorize(xi)

so that we can avoid searching the split of the x axis? Does a direct vectorization on x, like

s[A].vectorize(x)

generate different assembly codes and have a performance different from the above example?

Thanks in advance!

moderato · February 8, 2020, 7:32am

A similar question: suppose we parallelize some axis y like

s[A].parallel(y)

does making the length of y equal to OMP_NUM_THREADS (4 in my case) guarantee to be the best solution?

moderato · February 10, 2020, 11:07pm

@kevinthesun @vinx13 Can you help me with this question?

vinx13 · February 10, 2020, 11:16pm

If you know the optimal split size (e.g. from the size of register), you can split directly without searching. While s[A].vectorize(x) means vectorize the whole loop, which is impossible in many cases. On CPU, LLVM will decide how to handle such vectorization

kevinthesun · February 11, 2020, 12:12am

In your simple case, it is possible to directly fill in optimal(or near optimal) value. However, for more complicated cases, we still need autotvm.

moderato · February 11, 2020, 1:59am

Can you give me an example of the complicated cases you say?

moderato · February 11, 2020, 2:03am

Just to make it clear, are you indicating that LLVM might come up with a schedule with better performance for s[A].vectorize(x) than splitting x with factor = the register size? Or LLVM will automatically generate the same schedule as the split version?

vinx13 · February 11, 2020, 2:15am

If the vectorization is impossible due to hardware constraints, at the worst case it may generate ordinary loop (even if you specify as vectorized in TVM IR)

moderato · February 12, 2020, 7:44pm

I see! How about parallel I asked above? Is it also a similar case?

vinx13 · February 13, 2020, 3:59pm

yes, that will make each of iteration running in a worker thread