The following is a matrix multiplication where I only care about the diagonal band of the result. I have two questions:
1- Any suggestions how to optimize this code for GPU?
2- The formula in tvm.compute
accesses locations that are outside the memory allocated for X
and Y
, and surprisingly, the code works fine and doesn’t crash (it just writes some garbage values in certain locations in the result tensor Z
which is expected. Is this expected? Can I rely on this that the code won’t give segfault in the future? is there a better solution?
import tvm
W = 128
n = tvm.convert(16304)
m = tvm.convert(64)
c = tvm.convert(2 * W + 1)
X = tvm.placeholder((n, m), name='X')
Y = tvm.placeholder((n, m), name='Y')
k = tvm.reduce_axis((0, m), name='k')
Z = tvm.compute((n, c), lambda i, j : tvm.sum(X[i, k] * Y[ i + j - W, k], axis=k), name='Z')
s = tvm.create_schedule(Z.op)
s[Z].bind(Z.op.axis[0], tvm.thread_axis("blockIdx.x")) # not great binding
s[Z].bind(Z.op.axis[1], tvm.thread_axis("blockIdx.y")) # not great binding
fmm = tvm.build(s, [X, Y, Z], 'cuda', target_host='llvm', name='fmm')