The following is a matrix multiplication where I only care about the diagonal band of the result. I have two questions:
1- Any suggestions how to optimize this code for GPU?
2- The formula in
tvm.compute accesses locations that are outside the memory allocated for
Y, and surprisingly, the code works fine and doesn’t crash (it just writes some garbage values in certain locations in the result tensor
Z which is expected. Is this expected? Can I rely on this that the code won’t give segfault in the future? is there a better solution?
import tvm W = 128 n = tvm.convert(16304) m = tvm.convert(64) c = tvm.convert(2 * W + 1) X = tvm.placeholder((n, m), name='X') Y = tvm.placeholder((n, m), name='Y') k = tvm.reduce_axis((0, m), name='k') Z = tvm.compute((n, c), lambda i, j : tvm.sum(X[i, k] * Y[ i + j - W, k], axis=k), name='Z') s = tvm.create_schedule(Z.op) s[Z].bind(Z.op.axis, tvm.thread_axis("blockIdx.x")) # not great binding s[Z].bind(Z.op.axis, tvm.thread_axis("blockIdx.y")) # not great binding fmm = tvm.build(s, [X, Y, Z], 'cuda', target_host='llvm', name='fmm')