GPU optimization and boundary conditions


The following is a matrix multiplication where I only care about the diagonal band of the result. I have two questions:

1- Any suggestions how to optimize this code for GPU?
2- The formula in tvm.compute accesses locations that are outside the memory allocated for X and Y, and surprisingly, the code works fine and doesn’t crash (it just writes some garbage values in certain locations in the result tensor Z which is expected. Is this expected? Can I rely on this that the code won’t give segfault in the future? is there a better solution?

import tvm
W = 128
n = tvm.convert(16304)
m = tvm.convert(64)
c = tvm.convert(2 * W + 1)
X = tvm.placeholder((n, m), name='X')
Y = tvm.placeholder((n, m), name='Y')
k = tvm.reduce_axis((0, m), name='k')
Z = tvm.compute((n, c), lambda i, j :  tvm.sum(X[i, k] * Y[ i + j - W, k], axis=k), name='Z')
s = tvm.create_schedule(Z.op)
s[Z].bind(Z.op.axis[0], tvm.thread_axis("blockIdx.x"))  # not great binding
s[Z].bind(Z.op.axis[1], tvm.thread_axis("blockIdx.y"))  # not great binding
fmm =, [X, Y, Z], 'cuda', target_host='llvm', name='fmm')


You can use AutoTVM find the optimal schedule. See


I didn’t know this exists. Very nice. Thanks.

Any idea about the second question?


Generally you shouldn’t rely on undefined behavior. But you can inspect the generated CUDA code to see if it is what you want


The second question,:
In my practice, it will crash in most time if using target cpu, and it’s unpredictable using target cuda.
So I’m also wondering the interpretation about how tvm handle the boundary conditions?


Can you give an example what kind of boundary condition you need?


I used tvm.if_then_else as in the example here to avoid accessing invalid memory.


Below is a gauss blur example, both input and output shape need to be (1080,1920), template shape is (3,3), the memory access of A will over the boundary and cause invalid memory error.

M = 1080
N = 1920
KH = 3
KW = 3
A = tvm.placeholder((M, N), name='A', dtype=dtype)
K = tvm.placeholder((KH, KW), name='K', dtype=dtype)

ry = tvm.reduce_axis((0, KH), name='ry')
rx = tvm.reduce_axis((0, KW), name='rx')

C = tvm.compute((M, N), lambda i, j: tvm.sum((A[i + ry - 1, j + rx - 1] *K[ry, rx]), axis=[ry, rx]), name='C')

If i want to avoid over boundary, i can only change the C shape to (1080-2,1920-2) like this:

C = tvm.compute((M-2, N-2), lambda i, j: tvm.sum((A[i + ry, j + rx] *K[ry, rx]), axis=[ry, rx]), name=‘C’)

And at last do a if_then_else transformation for C from shape (1080-2,1920-2) to (1080,1920),this always cause performance down on CPU.

Another way is do if_else_then in compute C within tvm.sum(if_then_else can’t be used outside the tvm.sum), this method also cause performance down on CPU.


It will cause illegal access (even it doesn’t crash)
Doing if_else_then inside tvm.sum is actually what we are doing in conv2d.


I can only find one place at here

The other conv2d actually do pad first as it’s necessary, pad operation also have performance loss if shape is large.