OK to fuse reduce axis with normal axis? / partial reduction supported?

Hello! I’m trying to implement a depthwise convolution with TVM on NVidia GPUs.

Suppose the filter size is 1x3x3x32 (NHWC), if I let each thread handle one single input, the code is supposed to do 32 reductions, each with 9 elements. However, when I fuse ry, rx (the 2 reduce axis) and c (the channel axis), and bind the fused axis to thread_x, in the IR code TVM always gives me a tvm_all_reduce function. I check the CUDA kernel it generates and it does one single reduction with 288 elements.

Is there any way to do a partial reduction in this case? Am I wrong to fuse reduce axis with normal axis? If so, any suggestions to make the code run? Thanks in advance!