Expr Simplifier for tvm.var

kevinthesun · August 7, 2019, 9:01pm

Hi,

I got an warning while doing a symbolic bound loop partition:

Cannot prove: ((((((n + 1)/2) - 1) - (((n - 4)/2) + 1)) + 1) >= 0), when generating the post doubt loop

This expression should be true. However, neither rewrite_simplify nor canonical_simplify can prove it. Did I miss something here?

@tqchen

umangyadav · August 8, 2019, 12:16pm

I have also noticed that sometimes, Simplifier is not able to simplify expressions in the iteration variable.
e.g. i == -1 is always false.

@kevinthesun I am using SuperSimplify as in here with vrange passed to it. This wouldn’t solve the cause of the problem but you can give it a try.

github.com

dmlc/tvm/blob/cd1375e21ee87dfdabd5adcb50d187b03011f47c/src/pass/zero_elimination.cc#L87


  return const_true();
}
}


// Create a select statement of the form cond ? on_true : 0
Expr SelectElseZero(const Expr& cond, const Expr& on_true) {
return Select::make(cond, on_true, make_zero(on_true.type()));
}


// Simplify the expression as thoroughly as possible by using all available simplifiers.
Expr SuperSimplify(Expr e, const Map<Var, Range>& vranges = Map<Var, Range>()) {
// For some reason no simplifier can detect that there is only one value of the variable
std::unordered_map<const Variable*, Expr> vmap;
for (const auto& var_range : vranges) {
  if (is_const_int(var_range.second->extent, 1)) {
    vmap[var_range.first.get()] = var_range.second->min;
  }
}
if (!vmap.empty()) {
  e = Substitute(e, vmap);
}

kevinthesun · August 8, 2019, 5:30pm

Thank you for this information! I can’t find this pass in master. Is it moved to somewhere else?

kevinthesun · August 8, 2019, 6:39pm

I created a min-sample for symbolic expr issue:

import tvm
import topi

dshape = (tvm.var("n"), 72, 96)
target = "cuda"

def compute(data):
    oshape = data.shape
    out = tvm.compute(oshape, lambda i, j, k: data[i, j, k] * 10)
    return out

def schedule(s, out):
    n, m, _ = s[out].op.axis
    bn_z, n = s[out].split(n, 32)
    bn_y, bn_x = s[out].split(n, 8)

    tm_z, m = s[out].split(m, 12)
    tm_y, tm_x = s[out].split(m, 1)

    s[out].bind(bn_z, tvm.thread_axis("blockIdx.z"))
    s[out].bind(bn_y, tvm.thread_axis("blockIdx.y"))
    s[out].bind(bn_x, tvm.thread_axis("blockIdx.x"))

    s[out].bind(tm_z, tvm.thread_axis("threadIdx.z"))
    s[out].bind(tm_y, tvm.thread_axis("threadIdx.y"))
    s[out].bind(tm_x, tvm.thread_axis("threadIdx.x"))
    return s


d = tvm.placeholder(dshape, name="data")
out = compute(d)
s = tvm.create_schedule(out.op)
s = schedule(s, out)
f = tvm.build(s, [d, out], target)

Lower stmt printed in tvm.build:

produce compute {
  // attr [iter_var(blockIdx.z, , blockIdx.z)] thread_extent = ((n + 31)/32)
  // attr [iter_var(blockIdx.y, , blockIdx.y)] thread_extent = 4
  // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 8
  // attr [iter_var(threadIdx.z, , threadIdx.z)] thread_extent = 6
  // attr [iter_var(threadIdx.y, , threadIdx.y)] thread_extent = 12
  // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 1
  if (((blockIdx.z < (((n - 32)/32) + 1)) && (blockIdx.z < (((n + 31)/32) - 1)))) {
    for (k, 0, 96) {
      compute[((((((blockIdx.z*221184) + (blockIdx.y*55296)) + (blockIdx.x*6912)) + (threadIdx.z*1152)) + (threadIdx.y*96)) + k)] = (data[((((((blockIdx.z*221184) + (blockIdx.y*55296)) + (blockIdx.x*6912)) + (threadIdx.z*1152)) + (threadIdx.y*96)) + k)]*10f)
    }
  } else {
    for (k, 0, 96) {
      if (((((blockIdx.z*32) + (blockIdx.y*8)) + blockIdx.x) < n)) {
        if (((((blockIdx.z*32) + (blockIdx.y*8)) + blockIdx.x) < n)) {
          compute[((((((blockIdx.z*221184) + (blockIdx.y*55296)) + (blockIdx.x*6912)) + (threadIdx.z*1152)) + (threadIdx.y*96)) + k)] = (data[((((((blockIdx.z*221184) + (blockIdx.y*55296)) + (blockIdx.x*6912)) + (threadIdx.z*1152)) + (threadIdx.y*96)) + k)]*10f)
        }
      }
    }
  }
}

The two problems are: 1) if (((blockIdx.z < (((n - 32)/32) + 1)) && (blockIdx.z < (((n + 31)/32) - 1)))) is not simplified. 2) Several if statements are under for loop of k, which can be moved up to reduce the number of executions.

@tqchen Would you think this can be improved by simplifier, or other parts of tvm?

tqchen · August 8, 2019, 8:09pm

This is likely due to the fact that because the simplifier was not given the bound information, note that some of the cases depends on the division semantics, upgrading some of them to floordiv, which we have not yet done yet, might help some of the cases.

umangyadav · August 8, 2019, 8:14pm

Yes, it is not on the master, since it is not merged. It is simply three calls to simplify. You can get the implementation on the permalink I posted above.

CanonicalSimplify(Simplify(CanonicalSimplify(stmt, vrange), vrange, vrange))) . I am not sure with if with new arith infra this is any better.

So, maybe as Tianqi said, you can first try passing bound info to simplifier. or then may be try that.

kevinthesun · August 9, 2019, 7:44pm

Do you think the second issue regrading the position of if statement is also related to expression simplifier? In more complicated symbolic expression such as conv2d, this is the major performance bottleneck.

kevinthesun · August 12, 2019, 9:20pm

@tqchen Do you think we need to add loop invariant optimization pass to deal with this issue?

kevinthesun · August 23, 2019, 5:07pm

I prototyped a pass which detects IfThenElse statements with loop invariant condition and lifts them. This resolves the performance issue for symbolic shape compilation on cuda. I think we can add such a light weight pass for tvm ir. We can even trigger this pass only when necessary. (For example, symbolic shape compilation for cuda target) @tqchen What do you think?

tqchen · August 29, 2019, 3:24am

I think it could be an useful pass indeed

forrestm-quic · September 3, 2019, 2:09pm

We ran into this issue while working on the hexagon backend. We solved it by adding extra rewrite rules in rewrite_simplifier.cc to factor out a common multiplier / denominator, e.g.

    TVM_TRY_RECURSIVE_REWRITE_IF((x / c1) - (y / c1) < c2, x - y < c1 * c2, c1.Eval()->value > 0);

and similar for common multipliers.