There are schedule primitives like rfactor() in Halide/TVM. For cuda backend, we can use rfactor() to map a reduction process to GPU threads. But I don’t know how to realize the reduction process in a thread block using shared memory. Would you give me some advice?
Are you talking about this?
Thank you, that is what I need.