[SOLVED] Can auto-tuner frameworks like Halide/TVM generate high performance reduction algorithm?


There are schedule primitives like rfactor() in Halide/TVM. For cuda backend, we can use rfactor() to map a reduction process to GPU threads. But I don’t know how to realize the reduction process in a thread block using shared memory. Would you give me some advice?


Are you talking about this?


Thank you, that is what I need.