[SOLVED] Possible bug with tvm.scan


#1

When tvm.scan is followed by tvm.compute, tvm.build incurs Segmentation fault (core dumped). I don’t know whether it’s a bug or I failed to use it correctly.

TVM version: 0.6.dev
Python version: 3.7.3
Minimum reproducible code:

import tvm
import numpy as _np

def compute_cumsum1():
    m = tvm.var("m")
    n = tvm.var("n")
    X = tvm.placeholder((m, n), name="X")
    s_state = tvm.placeholder((m, n))
    s_init = tvm.compute((1, n), lambda _, i: X[0, i])
    s_update = tvm.compute((m, n), lambda t, i: s_state[t-1, i] + X[t, i])
    s_scan = tvm.scan(s_init, s_update, s_state, inputs=[X])
    ret = s_scan
    s = tvm.create_schedule(ret.op)
    return s, X, ret


def compute_cumsum2():
    m = tvm.var("m")
    n = tvm.var("n")
    X = tvm.placeholder((m, n), name="X")
    s_state = tvm.placeholder((m, n))
    s_init = tvm.compute((1, n), lambda _, i: X[0, i])
    s_update = tvm.compute((m, n), lambda t, i: s_state[t-1, i] + X[t, i])
    s_scan = tvm.scan(s_init, s_update, s_state, inputs=[X])
    ret = tvm.compute((m, n), lambda i, j: s_scan[(i, j)])
    s = tvm.create_schedule(ret.op)
    return s, X, ret


s, A, ret = compute_cumsum1()
fscan = tvm.build(s, [A, ret])
print("build cumsum1 successfully")
s, A, ret = compute_cumsum2()
fscan = tvm.build(s, [A, ret])
print("build cumsum2 successfully")

As can be seen, compute_cumsum1() builds fine. But compute_cumsum2() incurs Segmentation Fault.
Thanks for kind help. @reminisce @yzhliu


#2

@tqchen do you have any idea about it?


#3

It is due to that in second case, tvm uses ComputeOpNode::PropBoundToInputs, while during ScanOpNode::GatherBound, it tries to access tensor_dom[output].data[k+1], which is out-of-bound


#4

@hzfan found another bug after the fix https://github.com/dmlc/tvm/pull/3723

import tvm
import numpy as np

def replay():
  m = tvm.var("m")
  n = tvm.var("n")
  l = tvm.var("l")
  s_state = tvm.placeholder((n, m, l), dtype="int32")
  s_init = tvm.compute((1, m, l), lambda *idx: 1)
  s_update = tvm.compute((n, m, l), lambda i, j, k: s_state[i-1, j, k] + 1)
  ret = tvm.scan(s_init, s_update, s_state)
  ret = tvm.compute((n, m, l), lambda *idx: ret[idx])
  s = tvm.create_schedule(ret.op)
  return s, ret

s, ret = replay()
print(tvm.lower(s, [ret], simple_mode=True))
f = tvm.build(s, [ret])
ctx = tvm.cpu()
a = tvm.nd.array(np.zeros((4, 6, 4), dtype="int32"), ctx)
f(a)
print(a)

The (incorrect) result is,

[[[1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]
  [2 2 2 2]
  [2 2 2 2]]

 [[2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]
  [3 3 3 3]
  [3 3 3 3]]

 [[3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]
  [4 4 4 4]
  [4 4 4 4]]

 [[4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]
  [5 5 5 5]
  [5 5 5 5]]]

instead of

[[[1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]
  [1 1 1 1]]

 [[2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]
  [2 2 2 2]]

 [[3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]
  [3 3 3 3]]

 [[4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]
  [4 4 4 4]]]

Note that the stride is incorrect, (should be scan.idx*m)

// attr [scan] storage_scope = "global"
allocate scan[int32 * ((n*l)*l)]
produce scan {
  for (i1, 0, m) {
    for (i2, 0, l) {
      scan[((i1*l) + i2)] = 1
    }
  }
  for (scan.idx, 0, (n - 1)) {
    for (j, 0, m) {
      for (k, 0, l) {
        scan[(((((scan.idx + 1)*l) + j)*l) + k)] = (scan[((((scan.idx*l) + j)*l) + k)] + 1)
      }
    }
  }
}
produce compute {
  for (i0, 0, n) {
    for (i1, 0, m) {
      for (i2, 0, l) {
        compute[((((i0*m) + i1)*l) + i2)] = scan[((((i0*l) + i1)*l) + i2)]
      }
    }
  }
}

#5

fixed by https://github.com/dmlc/tvm/pull/3723