Tensorize with stride for input tensors

I’m trying to utilize tensorize for the matrix multiplication workload. The assumed intrinsic is a matrix multiplication smaller input sizes. I’m trying to tile the matrices and reduce axis to fit the intrinsic in by specifying stride for input matrices to handle the case of slicing the input matrices, but I’m getting the following error. The full code can be found this gist.

Traceback (most recent call last):

  File "gemm.py", line 106, in <module>
    print(tvm.lower(s, [A, B, C], simple_mode=True))

  File "/home/jsteward/work/tvm/python/tvm/build_module.py", line 392, in lower
    stmt = ir_pass.StorageFlatten(stmt, binds, 64, cfg.instrument_bound_checkers)

  File "tvm/_ffi/_cython/./function.pxi", line 304, in tvm._ffi._cy3.core.FunctionBase.__call__

  File "tvm/_ffi/_cython/./function.pxi", line 249, in tvm._ffi._cy3.core.FuncCall

  File "tvm/_ffi/_cython/./base.pxi", line 160, in tvm._ffi._cy3.core.CALL

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /home/jsteward/work/tvm/build/libtvm.so(tvm::NodeFunctor<tvm::Stmt (tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*)>::operator()(tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*) const+0x62) [0x7f62d4095212]
  [bt] (7) /home/jsteward/work/tvm/build/libtvm.so(+0x6642eb) [0x7f62d42dd2eb]
  [bt] (6) /home/jsteward/work/tvm/build/libtvm.so(tvm::ir::IRMutator::Mutate_(tvm::ir::For const*, tvm::Stmt const&)+0xb9) [0x7f62d42df4c9]
  [bt] (5) /home/jsteward/work/tvm/build/libtvm.so(tvm::ir::IRMutator::Mutate(tvm::Stmt)+0x5b) [0x7f62d409537b]
  [bt] (4) /home/jsteward/work/tvm/build/libtvm.so(tvm::NodeFunctor<tvm::Stmt (tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*)>::operator()(tvm::runtime::ObjectRef const&, tvm::Stmt const&, tvm::ir::IRMutator*) const+0x62) [0x7f62d4095212]
  [bt] (3) /home/jsteward/work/tvm/build/libtvm.so(+0x66424b) [0x7f62d42dd24b]
  [bt] (2) /home/jsteward/work/tvm/build/libtvm.so(tvm::ir::StorageFlattener::Mutate_(tvm::ir::AttrStmt const*, tvm::Stmt const&)+0x86e) [0x7f62d438d45e]
  [bt] (1) /home/jsteward/work/tvm/build/libtvm.so(tvm::ir::StorageFlattener::HandleBufferBindScope(tvm::ir::AttrStmt const*)+0xafa) [0x7f62d438a79a]
  [bt] (0) /home/jsteward/work/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7f62d4024013]
  File "../src/pass/storage_flatten.cc", line 432
TVMError: Check failed: slice->strides.size() == 0U (2 vs. 0) : Trying to bind compact buffer to strided one strides=[512, 1]

From python/tvm/tensor_intrin.py, the decl_tensor_intrin function is using the operator’s input tensors. Is this causing the issue? Or is it that I’m not using tensorize correctly in this case? Thanks in advance!

By default the tensor buffer declaration requires a compact buffer, that means that the tensorized region need to be contiguous. To relax the constraint, you can declare a buffer with symbolic strides when declaring the tensor intrin, of course your low level instruction must also support strided matrices as an input

As the tensorize tutorial shows to split the reduction axis, I’ve declared symbolic vars for the stride parameter during buffer declaration:

strideA = tvm.var("sA")
Ab = tvm.decl_buffer(a.shape, a.dtype,
                     name="A",
                     offset_factor=1,
                     strides=[strideA, 1])
strideB = tvm.var("sB")
Bb = tvm.decl_buffer(b.shape, b.dtype,
                     name="B",
                     offset_factor=1,
                     strides=[strideB, 1])
strideC = tvm.var("sC")
Cb = tvm.decl_buffer(c.shape, c.dtype,
                     name="C",
                     offset_factor=1,
                     strides=[strideC, 1])

However the strides for the input tensors did not make their way to the tensor intrin_func ins: the input tensors have the stride list empty. This lead to my using the symbolic strideA and friends directly in intrin_func, but did not work and thrown the error above.

It’s worth noting that the outs tensors (c in the sample code) do carry the correct stride information, just the ins don’t. To verify, we can try to print out the strides when declaring the buffer and when using them inside intrin_func:

def intrin_func(ins, outs):
    aa, bb = ins
    cc, = outs
    print(aa.strides, bb.strides, cc.strides)
    ...
with tvm.build_config(offset_factor=1):
    print(Ab.strides, Bb.strides, Cb.strides)
    return tvm.decl_tensor_intrin(c.op, intrin_func, binds={a: Ab, b: Bb, c: Cb}, name="sp_gemm")

And we get:

[sA, 1] [sB, 1] [sC, 1]
[] [] [sC, 1]

right before the error message.

Interesting, is it possible for you to dig a bit further?

I think I found the cause. When declaring the intrinsic, I accidentally used the global input tensors instead of the locally created ones (used A instead of a), as https://gist.github.com/KireinaHoro/93a934edc3e472ccc7c592b88e915a59#file-gemm-py-L37 shows. This results in python/tvm/tensor_intrin.py failing to match the created buffers in the bind list when creating the TensorIntrin node. Thanks for the attention!

Maybe we should provide some notice in the tensorize tutorial to note this situation, or to provide some helpful error messages indicating something like this have happened. I’ve got nothing on my mind though.

1 Like

Glad you figured it out. if you have any idea to improve the error message, feel free to send a PR

As you said,“To relax the constraint, you can declare a buffer with symbolic strides when declaring the tensor intrin”,

Seems to understand a little bit, but I do not know how to declare the buffer and set the strides.

I want to using intrisic to tensorize the scheduled conv2d. but meet the same errors when lowered the tensorized compute. Ask Methods to fix error when using tensorize schedule