 GPU optimization and boundary conditions

#1

The following is a matrix multiplication where I only care about the diagonal band of the result. I have two questions:

1- Any suggestions how to optimize this code for GPU?
2- The formula in tvm.compute accesses locations that are outside the memory allocated for X and Y, and surprisingly, the code works fine and doesn’t crash (it just writes some garbage values in certain locations in the result tensor Z which is expected. Is this expected? Can I rely on this that the code won’t give segfault in the future? is there a better solution?

import tvm
W = 128
n = tvm.convert(16304)
m = tvm.convert(64)
c = tvm.convert(2 * W + 1)
X = tvm.placeholder((n, m), name='X')
Y = tvm.placeholder((n, m), name='Y')
k = tvm.reduce_axis((0, m), name='k')
Z = tvm.compute((n, c), lambda i, j :  tvm.sum(X[i, k] * Y[ i + j - W, k], axis=k), name='Z')
s = tvm.create_schedule(Z.op)
s[Z].bind(Z.op.axis, tvm.thread_axis("blockIdx.x"))  # not great binding
s[Z].bind(Z.op.axis, tvm.thread_axis("blockIdx.y"))  # not great binding
fmm = tvm.build(s, [X, Y, Z], 'cuda', target_host='llvm', name='fmm')

#2

You can use AutoTVM find the optimal schedule. See https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html#

#3

I didn’t know this exists. Very nice. Thanks.

Any idea about the second question?

#4

Generally you shouldn’t rely on undefined behavior. But you can inspect the generated CUDA code to see if it is what you want

#5

The second question,:
In my practice, it will crash in most time if using target cpu, and it’s unpredictable using target cuda.
So I’m also wondering the interpretation about how tvm handle the boundary conditions?

#6

Can you give an example what kind of boundary condition you need?

#7

I used tvm.if_then_else as in the example here to avoid accessing invalid memory.

#8

Below is a gauss blur example, both input and output shape need to be (1080,1920), template shape is (3,3), the memory access of A will over the boundary and cause invalid memory error.

M = 1080
N = 1920
KH = 3
KW = 3
A = tvm.placeholder((M, N), name='A', dtype=dtype)
K = tvm.placeholder((KH, KW), name='K', dtype=dtype)

ry = tvm.reduce_axis((0, KH), name='ry')
rx = tvm.reduce_axis((0, KW), name='rx')

C = tvm.compute((M, N), lambda i, j: tvm.sum((A[i + ry - 1, j + rx - 1] *K[ry, rx]), axis=[ry, rx]), name='C')

If i want to avoid over boundary, i can only change the C shape to (1080-2,1920-2) like this:

C = tvm.compute((M-2, N-2), lambda i, j: tvm.sum((A[i + ry, j + rx] *K[ry, rx]), axis=[ry, rx]), name=‘C’)

And at last do a if_then_else transformation for C from shape (1080-2,1920-2) to (1080,1920)，this always cause performance down on CPU.

Another way is do if_else_then in compute C within tvm.sum(if_then_else can’t be used outside the tvm.sum), this method also cause performance down on CPU.

#9

It will cause illegal access (even it doesn’t crash)
Doing if_else_then inside tvm.sum is actually what we are doing in conv2d.

#10

I can only find one place at here

The other conv2d actually do pad first as it’s necessary, pad operation also have performance loss if shape is large.
@vinx13