I agree. And I have the same question as @Hzfengsy ask. Why we need NVRTC instead of NVCC? The NVRTC will benefit the compilation overhead but we are targeting better performance.
@vinx13 @Hzfengsy @xyzhou I guess a more decent solution is to remove +
, *
, <=
override, and add cross-thread-reduction support in codegen, which then can work with NVRTC as well, right?
I’m not sure about the cross-thread-reduction. But it is OK once we can run fp16 resnet on both NVCC and NVRTC.
will adding cast to codegen (i.e (const __half) xx), instead of overloading for volatile work?
The code compiles but the output is wrong. This happens with fp16 and s.rfactor
, and it is very easy to reproduce following the example in this tutorial https://docs.tvm.ai/tutorials/language/reduction.html#sphx-glr-tutorials-language-reduction-py. The code I am using is master + uncommenting the volatile overrides here https://github.com/apache/incubator-tvm/pull/4331/files#diff-cb015f72761c78798065a2cd9ef96602R68-R73
This code snippet should reproduce the bug:
import tvm
from tvm.contrib import nvcc
import numpy as np
@tvm.register_func
def tvm_callback_cuda_compile(code):
# use nvcc because nvrtc doesn't compile
ptx = nvcc.compile_cuda(code, target="ptx", arch='sm_75')
return ptx
n = tvm.var("n")
m = tvm.var("m")
A = tvm.placeholder((n, m), name='A', dtype='float16') # float16 not instead of float32
k = tvm.reduce_axis((0, m), "k")
B = tvm.compute((n,), lambda i: tvm.sum(A[i, k], axis=k), name="B")
s = tvm.create_schedule(B.op)
ko, ki = s[B].split(B.op.reduce_axis[0], factor=8)
BF = s.rfactor(B, ki) # <<== other schedules that don't use `s.rfactor` work
print(tvm.lower(s, [A, B], simple_mode=True))
print(s[B].op.body)
tx = tvm.thread_axis("threadIdx.x")
s[B].bind(s[B].op.reduce_axis[0], tx)
s[BF].compute_at(s[B], s[B].op.reduce_axis[0])
s[B].set_store_predicate(tx.var.equal(0))
fcuda = tvm.build(s, [A, B], "cuda")
print(fcuda.imported_modules[0].get_source())
nn = 16
ctx = tvm.gpu(0)
a = tvm.nd.array(np.random.uniform(size=(nn, nn)).astype(A.dtype), ctx)
b = tvm.nd.array(np.zeros(nn, dtype=B.dtype), ctx)
fcuda(a, b)
tvm.testing.assert_allclose(b.asnumpy(), np.sum(a.asnumpy(), axis=1), rtol=1e-2)
# output:
> Mismatch: 100%
> Max absolute difference: 4.74
> Max relative difference: 0.5747
> x: array([3.354, 3.623, 4.797, 4.195, 3.457, 4.008, 3.768, 4.277, 3.785,
> 4.727, 5.004, 5.484, 3.248, 3.504, 4.336, 5.746], dtype=float16)
> y: array([7.84 , 6.816, 8.48 , 7.344, 6.07 , 8.4 , 7.055, 7.035, 8.03 ,
> 9.36 , 9.09 , 9.51 , 7.387, 8.24 , 6.81 , 9.47 ], dtype=float16)
Any idea what could be wrong, @vinx13 @Hzfengsy @xyzhou @yzhliu ? Thanks
Is it because some volatile qualifiers or sync are missing?
I don’t know. How do I debug this ?
Here are an example how to override generated code
Do you find the reason behind it ?
no, unfortunately just switched back to fp32.
Thanks a lot, so this may be a bug of tvm
Should be fixed by this PR.