In some cases local buffer is allocated in global memory as passed to the kernel as an argument.
When we use local buffer, the original IR is like
produce foo {
// attr [tmp] storage_scope = "local"
allocate total[float32 * 1]
...
}
After storage rewrite pass, the IR becomes
// attr [tmp] storage_scope = "local"
allocate total[float32 * 1]
produce foo {
...
}
This change is because the attach scope of the local buffer is nullptr if it is the first statement inside produce.
This bug only happens in ir builder or hybrid script.
In IR builder we have the workaround that we declare thread scope before allocation:
ib.scope_attr(tx, "thread_extent", nthread_tx)
tmp = ib.allocate(...)
But in hybrid we still have this issue.
Here is a reproducible hybrid script:
import tvm
@tvm.hybrid.script
def foo(a):
c = output_tensor((a.shape[0],), a.dtype)
total = allocate((1,), a.dtype, 'local')
len_i = a.shape[0]
len_j = a.shape[1]
for i in bind('threadIdx.x', len_i):
total[0] = 0.
for k in const_range(len_j):
total[0] += a[i, k]
c[i] = total[0]
return c
a_tvm = tvm.placeholder((8,4), 'float32')
c_tvm = foo(a_tvm)
s = tvm.create_schedule(c_tvm.op)
print(tvm.lower(s, [a_tvm, c_tvm], simple_mode=True))