AutoTVM and CPU vectorization: should I split?

Hello! Suppose my CPU supports AVX2 which supports operations with 256-bit registers (8 FP32 operands). Does that mean in AutoTVM we can always config like

# (suppose the length of x is 32)
xo, xi = s[A].split(x, factors=8)
s[A].unroll(xo)
s[A].vectorize(xi)

so that we can avoid searching the split of the x axis? Does a direct vectorization on x, like

s[A].vectorize(x)

generate different assembly codes and have a performance different from the above example?

Thanks in advance!

A similar question: suppose we parallelize some axis y like

s[A].parallel(y)

does making the length of y equal to OMP_NUM_THREADS (4 in my case) guarantee to be the best solution?

@kevinthesun @vinx13 Can you help me with this question?

If you know the optimal split size (e.g. from the size of register), you can split directly without searching. While s[A].vectorize(x) means vectorize the whole loop, which is impossible in many cases. On CPU, LLVM will decide how to handle such vectorization

In your simple case, it is possible to directly fill in optimal(or near optimal) value. However, for more complicated cases, we still need autotvm.

Can you give me an example of the complicated cases you say?

Just to make it clear, are you indicating that LLVM might come up with a schedule with better performance for s[A].vectorize(x) than splitting x with factor = the register size? Or LLVM will automatically generate the same schedule as the split version?

If the vectorization is impossible due to hardware constraints, at the worst case it may generate ordinary loop (even if you specify as vectorized in TVM IR)

I see! How about parallel I asked above? Is it also a similar case?

yes, that will make each of iteration running in a worker thread