I have been attempting to write a single Cuda kernel reduction that takes FP16 inputs cast the inputs to FP32 for the reduction and then saves the results back into FP16. Is this possible currently in a single kernel with Tensor Expressions? The error I see is that when I try to add a cast to the result I get an error suggesting that the last operation has to be the reduction. I have also tried adding another compute line to the schedule but I get the same error.
TVMError: Check failed: 0 == level_: Reductions are only allowed at the top level of compute. Please create another tensor for further composition.
Example:
import tvm
tgt_host="llvm"
tgt="cuda"
toks = tvm.var("tokens")
hidden = tvm.const(1024)
inputs = tvm.placeholder((toks, hidden), name='inputs', dtype='float16')
y = tvm.reduce_axis((0, hidden), "y")
outputs = tvm.compute((toks,), lambda x : tvm.sum(inputs[x][y].astype('float32'), axis=y).astype('float16'), name='outputs')
sched = tvm.create_schedule([outputs.op])
sched[outputs].bind(outputs.op.axis[0], tvm.thread_axis("blockIdx.x"))
sched[outputs].bind(outputs.op.reduce_axis[0], tvm.thread_axis("threadIdx.x"))