[ERROR]FP16 CUDA compilation error

xyzhou · November 13, 2019, 7:13am

I agree. And I have the same question as @Hzfengsy ask. Why we need NVRTC instead of NVCC? The NVRTC will benefit the compilation overhead but we are targeting better performance.

yzhliu · November 13, 2019, 7:28pm

@vinx13 @Hzfengsy @xyzhou I guess a more decent solution is to remove +, *, <= override, and add cross-thread-reduction support in codegen, which then can work with NVRTC as well, right?

Hzfengsy · November 13, 2019, 7:44pm

I’m not sure about the cross-thread-reduction. But it is OK once we can run fp16 resnet on both NVCC and NVRTC.

vinx13 · November 13, 2019, 7:56pm

will adding cast to codegen (i.e (const __half) xx), instead of overloading for volatile work?

yzhliu · November 14, 2019, 12:41am

ibeltagy · November 17, 2019, 8:34am

The code compiles but the output is wrong. This happens with fp16 and s.rfactor, and it is very easy to reproduce following the example in this tutorial https://docs.tvm.ai/tutorials/language/reduction.html#sphx-glr-tutorials-language-reduction-py. The code I am using is master + uncommenting the volatile overrides here https://github.com/apache/incubator-tvm/pull/4331/files#diff-cb015f72761c78798065a2cd9ef96602R68-R73

This code snippet should reproduce the bug:

import tvm
from tvm.contrib import nvcc
import numpy as np
@tvm.register_func
def tvm_callback_cuda_compile(code):
    # use nvcc because nvrtc doesn't compile
    ptx = nvcc.compile_cuda(code, target="ptx", arch='sm_75') 
    return ptx

n = tvm.var("n")
m = tvm.var("m")
A = tvm.placeholder((n, m), name='A', dtype='float16')  # float16 not instead of float32
k = tvm.reduce_axis((0, m), "k")
B = tvm.compute((n,), lambda i: tvm.sum(A[i, k], axis=k), name="B")

s = tvm.create_schedule(B.op)
ko, ki = s[B].split(B.op.reduce_axis[0], factor=8)
BF = s.rfactor(B, ki)  # <<== other schedules that don't use `s.rfactor` work
print(tvm.lower(s, [A, B], simple_mode=True))
print(s[B].op.body)

tx = tvm.thread_axis("threadIdx.x")
s[B].bind(s[B].op.reduce_axis[0], tx)
s[BF].compute_at(s[B], s[B].op.reduce_axis[0])
s[B].set_store_predicate(tx.var.equal(0))

fcuda = tvm.build(s, [A, B], "cuda")
print(fcuda.imported_modules[0].get_source())
nn = 16
ctx  = tvm.gpu(0)
a = tvm.nd.array(np.random.uniform(size=(nn, nn)).astype(A.dtype), ctx)
b = tvm.nd.array(np.zeros(nn, dtype=B.dtype), ctx)
fcuda(a, b)
tvm.testing.assert_allclose(b.asnumpy(),  np.sum(a.asnumpy(), axis=1), rtol=1e-2)

# output: 
> Mismatch: 100%
> Max absolute difference: 4.74
> Max relative difference: 0.5747
>  x: array([3.354, 3.623, 4.797, 4.195, 3.457, 4.008, 3.768, 4.277, 3.785,
>        4.727, 5.004, 5.484, 3.248, 3.504, 4.336, 5.746], dtype=float16)
>  y: array([7.84 , 6.816, 8.48 , 7.344, 6.07 , 8.4  , 7.055, 7.035, 8.03 ,
>        9.36 , 9.09 , 9.51 , 7.387, 8.24 , 6.81 , 9.47 ], dtype=float16)

Any idea what could be wrong, @vinx13 @Hzfengsy @xyzhou @yzhliu ? Thanks

vinx13 · November 17, 2019, 4:22pm

Is it because some volatile qualifiers or sync are missing?

ibeltagy · November 17, 2019, 4:38pm

I don’t know. How do I debug this ?

vinx13 · November 17, 2019, 4:54pm

Here are an example how to override generated code

github.com

apache/incubator-tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py#L27-L43


@tvm.register_func
def tvm_callback_cuda_compile(code):
    ptx =  nvcc.compile_cuda(code, target="ptx")
    return ptx


def write_code(code, fname):
    with open(fname, "w") as f:
        f.write(code)


@tvm.register_func
def tvm_callback_cuda_postproc(code):
    if not os.path.exists("perf"):
        os.mkdir("perf")
    write_code(code, "perf/%s_generated.cu" % TASK)
    if USE_MANUAL_CODE:
        code = open("perf/%s_manual.cu" % TASK).read()
    return code

eddie · February 11, 2020, 7:12am

ibeltagy:

import tvm from tvm.contrib import nvcc import numpy as np @tvm.register_func def tvm_callback_cuda_compile(code): # use nvcc because nvrtc doesn’t compile ptx = nvcc.compile_cuda(code, target=“ptx”, arch=‘sm_75’) return ptx n = tvm.var(“n”) m = tvm.var(“m”) A = tvm.placeholder((n, m), name=‘A’, dtype=‘float16’) # float16 not instead of float32 k = tvm.reduce_axis((0, m), “k”) B = tvm.compute((n,), lambda i: tvm.sum(A[i, k], axis=k), name=“B”) s = tvm.create_schedule(B.op) ko, ki = s[B].split(B.op.reduce_axis[0], factor=8) BF = s.rfactor(B, ki) # <<== other schedules that don’t use s.rfactor work print(tvm.lower(s, [A, B], simple_mode=True)) print(s[B].op.body) tx = tvm.thread_axis(“threadIdx.x”) s[B].bind(s[B].op.reduce_axis[0], tx) s[BF].compute_at(s[B], s[B].op.reduce_axis[0]) s[B].set_store_predicate(tx.var.equal(0)) fcuda = tvm.build(s, [A, B], “cuda”) print(fcuda.imported_modules[0].get_source()) nn = 16 ctx = tvm.gpu(0) a = tvm.nd.array(np.random.uniform(size=(nn, nn)).astype(A.dtype), ctx) b = tvm.nd.array(np.zeros(nn, dtype=B.dtype), ctx) fcuda(a, b) tvm.testing.assert_allclose(b.asnumpy(), np.sum(a.asnumpy(), axis=1), rtol=1e-2)

Do you find the reason behind it ?

ibeltagy · February 11, 2020, 5:59pm

no, unfortunately just switched back to fp32.

eddie · February 12, 2020, 2:51am

Thanks a lot, so this may be a bug of tvm

wpan11nv · February 17, 2020, 4:58am

Should be fixed by this PR.