When I read the implementation of PassUpDomain for FuseNode. I found that instead of only refer to the exactly sub-iter, it will return the whole dim. It will totally be a waste of time/memory.

FuseNode

For example, when I have the following tvm code ( although it seems very silly, for it can use compute_inline instead)

B = tvm.compute((4,8),lambda i,j: 0, name=‘B’)

C = tvm.compute((4,8),lambda i,j:B[i,j],name=‘C’)

s = tvm.create_schedule(C.op)

t = s[C].fuse(s[C].op.axis[0], s[C].op.axis[1])

outer, inner = s[C].split(t, factor = 2)

s[B].compute_at(s[C], outer)

I will get the following result after lowering:

```
produce C {
for (i.j.fused.outer, 0, 16) {
produce B {
for (i, 0, 4) {
for (j, 0, 8) {
B[((i*8) + j)] = 0
}
}
}
for (i.j.fused.inner, 0, 2) {
C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
}
}
}
```

It is wasteful because each iter of C, it only use 2 elements of B, but it will calculate the whole tensor B.

So I try to modify this parts of code. It seems very clear and easy to fix.

For example, if you merged to loop,

outer is of range *0-N*,

inner is of range *0-M*.

And then if you find you have to use elements of

fused loop of range *i-j*,

so you need to calculate outer from *i/N* to *j/N*, and inner of whole *0-M* for safety,

or if you can assure *i/M == j/M* , you can set inner loop from *i%M* to *j%M*.

This will save memory and time and still maintain correctness.

Follow the rule I mentioned above, we can get the new code

```
produce C {
for (i.j.fused.outer, 0, 16) {
produce B {
for (i, 0, (4 - max((i.j.fused.outer/4), 2))) {
for (j, 0, 8) {
if (likely(((i.j.fused.outer/4) < (4 - i)))) {
B[((((i.j.fused.outer/4) + i)*8) + j)] = 0
}
}
}
}
for (i.j.fused.inner, 0, 2) {
C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
}
}
}
```