[SOLVED] Can auto-tuner frameworks like Halide/TVM generate high performance reduction algorithm?

xcenxcen · December 12, 2018, 6:32pm

There are schedule primitives like rfactor() in Halide/TVM. For cuda backend, we can use rfactor() to map a reduction process to GPU threads. But I don’t know how to realize the reduction process in a thread block using shared memory. Would you give me some advice?

merrymercy · December 11, 2018, 4:11am

Are you talking about this?
https://docs.tvm.ai/tutorials/language/reduction.html#cross-thread-reduction

xcenxcen · December 11, 2018, 4:36am

Thank you, that is what I need.