[Discuss] Modify PassUpDomain for efficiency

When I read the implementation of PassUpDomain for FuseNode. I found that instead of only refer to the exactly sub-iter, it will return the whole dim. It will totally be a waste of time/memory.
FuseNode
For example, when I have the following tvm code ( although it seems very silly, for it can use compute_inline instead)

B = tvm.compute((4,8),lambda i,j: 0, name=‘B’)
C = tvm.compute((4,8),lambda i,j:B[i,j],name=‘C’)
s = tvm.create_schedule(C.op)
t = s[C].fuse(s[C].op.axis[0], s[C].op.axis[1])
outer, inner = s[C].split(t, factor = 2)
s[B].compute_at(s[C], outer)

I will get the following result after lowering:

produce C {
    for (i.j.fused.outer, 0, 16) {
        produce B {
            for (i, 0, 4) {
                for (j, 0, 8) {
                    B[((i*8) + j)] = 0
                }
            }
        }
        for (i.j.fused.inner, 0, 2) {
            C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
        }
    }
}

It is wasteful because each iter of C, it only use 2 elements of B, but it will calculate the whole tensor B.

So I try to modify this parts of code. It seems very clear and easy to fix.
For example, if you merged to loop,
outer is of range 0-N,
inner is of range 0-M.
And then if you find you have to use elements of
fused loop of range i-j,
so you need to calculate outer from i/N to j/N, and inner of whole 0-M for safety,
or if you can assure i/M == j/M , you can set inner loop from i%M to j%M.
This will save memory and time and still maintain correctness.

Follow the rule I mentioned above, we can get the new code

produce C {
    for (i.j.fused.outer, 0, 16) {
        produce B {
            for (i, 0, (4 - max((i.j.fused.outer/4), 2))) {
                for (j, 0, 8) {
                    if (likely(((i.j.fused.outer/4) < (4 - i)))) {
                        B[((((i.j.fused.outer/4) + i)*8) + j)] = 0
                    }
                }
             }
         }
       for (i.j.fused.inner, 0, 2) {
           C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
       }
   }
}

I just wanted to point out that there is a general documentation of the InferBound routine and the case you have here is also explained as a limitation of passupdomain.

If your algorithm could fix this limitation, I bet the TVM community would happily incorporate it.
Maybe consider creating a fix and allowing for code review?

Thank you very much. Thanks to your suggestion, I find there is someone who have already noticed this bug and tried to fix it https://github.com/dmlc/tvm/issues/3072
I have reviewed his code and think it is correct. So seems this discussion can be closed.