In my case, it has 4 stages A,B,C and D.A is just a input op, B is reduce op, C is a broadcast op and D is a add op.I make A compute_at in D(axis=2), B at C(axis=2,reduce_axis), and C at D(axis=2).But A’s buffer will allocate 102030 for B, not 10 expectantly.How to make A’s buffer allocate 10?