TVM doesn't vectorize when the split factor is not divisible by the axis length

As the title says. A simple example:

import tvm

def schedule(A, B):
    s = tvm.create_schedule(B.op)
    x, = s[B].op.axis
    xo, xi = s[B].split(x, factor=8)
    xoo, xoi = s[B].split(xo, factor=2)
    s[B].vectorize(xi)

    return s

def test():
    A = tvm.placeholder((128,), name="A")
    B = tvm.compute((128,), lambda i: A[i] + 1, name="B")

    device = "llvm -mcpu=core-avx2"
    ctx = tvm.context(device, 0)
    with tvm.target.create(device):
        s = schedule(A, B)

    print(tvm.lower(s, [A, B], simple_mode=True))
    func = tvm.build(s, [A, B], device, name=("test"))

if __name__ == "__main__":
    test()

When the split factor of xo is 2, the IR code looks like:

produce B {
  for (i.outer.outer, 0, 8) {
    for (i.outer.inner, 0, 2) {
      B[ramp(((i.outer.outer*16) + (i.outer.inner*8)), 1, 8)] = (A[ramp(((i.outer.outer*16) + (i.outer.inner*8)), 1, 8)] + x8(1f))
    }
  }
}

which shows that the axis xi is vectorized. However, if we change the xo split factor to 3, we get:

produce B {
  for (i.outer.outer, 0, 6) {
    for (i.outer.inner, 0, 3) {
      for (i.inner.s, 0, 8) {
        if (likely(((((i.outer.outer*24) + (i.outer.inner*8)) + i.inner.s) < 128))) {
          if (likely((((i.outer.outer*3) + i.outer.inner) < 16))) {
            if (likely(((((i.outer.outer*24) + (i.outer.inner*8)) + i.inner.s) < 128))) {
              B[(((i.outer.outer*24) + (i.outer.inner*8)) + i.inner.s)] = (A[(((i.outer.outer*24) + (i.outer.inner*8)) + i.inner.s)] + 1f)
            }
          }
        }
      }
    }
  }
}

and we see the axis xi is not vectorized anymore.

Is TVM designed to be like this? What if we really want to use something like 3 in this case and have the code correctly vectorized?

Thanks in advance!

2 Likes

The cause is the likely. When we check the condition.dtype().is_vector(), we will Scalarize it, which will make the for type be Serial.

I see. Can it be avoided?

Hmm…I think it is a little complicate. It should be the same phenomenon as this : https://stackoverflow.com/questions/15372885/if-statements-with-comparison-sse-in-c We should handle it very carefully and maybe we need mask store.

I would be curious to see how this can be done with TVM. Let me know if I can help in any way.

Maybe you could get spirit from halide: https://github.com/halide/Halide/blob/master/src/VectorizeLoops.cpp#L753 Old implementation of Halide like TVM, just scalarize. However, I found they improve it just now.