Vectorization for split factors non-divisible by the axis length

Hello! I found that TVM doesn’t vectorize when the split factor is not divisible by the axis length. I discussed it in the following post with @FrozenGene.

This problem sometimes greatly affects the runtime performance and leaves the users very few choices of, for example, the smallest block size in GEMM, because only a limited number of such choices can result in vectorization in code generation.

Here’s an article discussing GEMM optimization on an AVX2 machine: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0, mentioning sometimes the best smallest block size in GEMM implemented with instruction sets like AVX2 is somewhat “uncommon”, e.g. 2x5, 3x4, etc.

I wonder if TVM has any plan of making a new feature to avoid this situation? Any advice is appreciated!

@tqchen

cc @Hzfengsy @merrymercy

This is something that we eventually want to handle, by splitting the loop along a certain direction, so most of the main body can be vectorized

Thank you for your reply! Is this gonna be a fix that involves a lot of changes? In the above post, @FrozenGene suggested a solution in Halide with a similar idea:

Do you think it fits in what TVM currently has?