[Codegen] codegen with minimum if condition


#1

If condition often slows things down, and it often occurs when split / tile factors do not divide the shape. For example, split an axis of length 1023 by 32 yields a if condition, while split an axis of length 1024 by 32 does not. Note that split an axis of length tvm.var('m') also yields a if condition, as 32 may not divide m .

What if we specify the shape as (32 * tvm.var('M'), 32 * tvm.var('N')) to give more hint to the compiler? I tried it out in a matrix multiplication, but unfortunately the if condition is not eliminated.

import tvm
import numpy
import timeit

M = tvm.const(32) * tvm.var('M')
K = tvm.const(32) * tvm.var('K')
N = tvm.const(32) * tvm.var('N')
M_num = 1024
K_num = 1024
N_num = 1024

# The default tensor type in tvm
dtype = "float32"

target = 'llvm'
ctx = tvm.context(target, 0)

# Random generated tensor for testing
a = tvm.nd.array(numpy.random.rand(M_num, K_num).astype(dtype), ctx)
b = tvm.nd.array(numpy.random.rand(K_num, N_num).astype(dtype), ctx)

# Algorithm
k = tvm.reduce_axis((0, K), 'k')
A = tvm.placeholder((M, K), name='A')
B = tvm.placeholder((K, N), name='B')
C = tvm.compute(
           (M, N),
           lambda x, y: tvm.sum(A[x, k] * B[k, y], axis=k),
           name='C')

bn = 32
s = tvm.create_schedule(C.op)

# Blocking by loop tiling
xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
k, = s[C].op.reduce_axis
ko, ki = s[C].split(k, factor=4)

# Hoist reduction domain outside the blocking loop
s[C].reorder(xo, yo, ko, ki, xi, yi)

print(tvm.lower(s, [A, B, C], simple_mode=True))

The code generated:

produce C {
  for (x.outer, 0, (((M*32) + 31)/32)) {
    for (y.outer, 0, (((N*32) + 31)/32)) {
      for (x.inner.init, 0, 32) {
        for (y.inner.init, 0, 32) {
          if (likely((((x.outer*32) + x.inner.init) < (M*32)))) {
            if (likely((((y.outer*32) + y.inner.init) < (N*32)))) {
              C[(((y.outer*32) + (((x.outer*32) + x.inner.init)*(N*32))) + y.inner.init)] = 0f
            }
          }
        }
      }
      for (k.outer, 0, (((K*32) + 3)/4)) {
        for (k.inner, 0, 4) {
          for (x.inner, 0, 32) {
            for (y.inner, 0, 32) {
              if (likely((((x.outer*32) + x.inner) < (M*32)))) {
                if (likely((((y.outer*32) + y.inner) < (N*32)))) {
                  if (likely((((k.outer*4) + k.inner) < (K*32)))) {
                    if (likely((((x.outer*32) + x.inner) < (M*32)))) {
                      if (likely((((y.outer*32) + y.inner) < (N*32)))) {
                        if (likely((((k.outer*4) + k.inner) < (K*32)))) {
                          C[(((y.outer*32) + (((x.outer*32) + x.inner)*(N*32))) + y.inner)] = (C[(((y.outer*32) + (((x.outer*32) + x.inner)*(N*32))) + y.inner)] + (A[(((k.outer*4) + (((x.outer*32) + x.inner)*(K*32))) + k.inner)]*B[(((y.outer*32) + (((k.outer*4) + k.inner)*(N*32))) + y.inner)]))
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

As can be seen, 32 | 32 * m is not recognized and unnecessary if conditions are generated.

Is there any better ways to avoid the if condition and speed ops up? Note that I want to deal with inputs with general shapes (or at least a family of shapes, like the multiples of 32), instead of specific shapes like (1024, 1024).


#2

Thank @yzhliu for guidance.


#3

@hzfan would you mind to share how you resolved this issue?
I am thinking this could be related to Expr Simplifier for tvm.var.

Thanks.


#4

@comaniac actually the issue has not been resolved yet. I agree that the root cause of this issue is similar to the Expr Simplifier for tvm.var


#5