[ERROR]FP16 CUDA compilation error

I agree. And I have the same question as @Hzfengsy ask. Why we need NVRTC instead of NVCC? The NVRTC will benefit the compilation overhead but we are targeting better performance.

@vinx13 @Hzfengsy @xyzhou I guess a more decent solution is to remove +, *, <= override, and add cross-thread-reduction support in codegen, which then can work with NVRTC as well, right?

I’m not sure about the cross-thread-reduction. But it is OK once we can run fp16 resnet on both NVCC and NVRTC.

will adding cast to codegen (i.e (const __half) xx), instead of overloading for volatile work?

The code compiles but the output is wrong. This happens with fp16 and s.rfactor, and it is very easy to reproduce following the example in this tutorial https://docs.tvm.ai/tutorials/language/reduction.html#sphx-glr-tutorials-language-reduction-py. The code I am using is master + uncommenting the volatile overrides here https://github.com/apache/incubator-tvm/pull/4331/files#diff-cb015f72761c78798065a2cd9ef96602R68-R73

This code snippet should reproduce the bug:

import tvm
from tvm.contrib import nvcc
import numpy as np
@tvm.register_func
def tvm_callback_cuda_compile(code):
    # use nvcc because nvrtc doesn't compile
    ptx = nvcc.compile_cuda(code, target="ptx", arch='sm_75') 
    return ptx

n = tvm.var("n")
m = tvm.var("m")
A = tvm.placeholder((n, m), name='A', dtype='float16')  # float16 not instead of float32
k = tvm.reduce_axis((0, m), "k")
B = tvm.compute((n,), lambda i: tvm.sum(A[i, k], axis=k), name="B")

s = tvm.create_schedule(B.op)
ko, ki = s[B].split(B.op.reduce_axis[0], factor=8)
BF = s.rfactor(B, ki)  # <<== other schedules that don't use `s.rfactor` work
print(tvm.lower(s, [A, B], simple_mode=True))
print(s[B].op.body)

tx = tvm.thread_axis("threadIdx.x")
s[B].bind(s[B].op.reduce_axis[0], tx)
s[BF].compute_at(s[B], s[B].op.reduce_axis[0])
s[B].set_store_predicate(tx.var.equal(0))

fcuda = tvm.build(s, [A, B], "cuda")
print(fcuda.imported_modules[0].get_source())
nn = 16
ctx  = tvm.gpu(0)
a = tvm.nd.array(np.random.uniform(size=(nn, nn)).astype(A.dtype), ctx)
b = tvm.nd.array(np.zeros(nn, dtype=B.dtype), ctx)
fcuda(a, b)
tvm.testing.assert_allclose(b.asnumpy(),  np.sum(a.asnumpy(), axis=1), rtol=1e-2)

# output: 
> Mismatch: 100%
> Max absolute difference: 4.74
> Max relative difference: 0.5747
>  x: array([3.354, 3.623, 4.797, 4.195, 3.457, 4.008, 3.768, 4.277, 3.785,
>        4.727, 5.004, 5.484, 3.248, 3.504, 4.336, 5.746], dtype=float16)
>  y: array([7.84 , 6.816, 8.48 , 7.344, 6.07 , 8.4  , 7.055, 7.035, 8.03 ,
>        9.36 , 9.09 , 9.51 , 7.387, 8.24 , 6.81 , 9.47 ], dtype=float16)

Any idea what could be wrong, @vinx13 @Hzfengsy @xyzhou @yzhliu ? Thanks

Is it because some volatile qualifiers or sync are missing?

I don’t know. How do I debug this ?

Here are an example how to override generated code

1 Like

Do you find the reason behind it ?

no, unfortunately just switched back to fp32.

Thanks a lot, so this may be a bug of tvm

Should be fixed by this PR.

3 Likes