When I read the implementation of PassUpDomain for FuseNode. I found that instead of only refer to the exactly sub-iter, it will return the whole dim. It will totally be a waste of time/memory.
FuseNode
For example, when I have the following tvm code ( although it seems very silly, for it can use compute_inline instead)
B = tvm.compute((4,8),lambda i,j: 0, name=‘B’)
C = tvm.compute((4,8),lambda i,j:B[i,j],name=‘C’)
s = tvm.create_schedule(C.op)
t = s[C].fuse(s[C].op.axis[0], s[C].op.axis[1])
outer, inner = s[C].split(t, factor = 2)
s[B].compute_at(s[C], outer)
I will get the following result after lowering:
produce C {
for (i.j.fused.outer, 0, 16) {
produce B {
for (i, 0, 4) {
for (j, 0, 8) {
B[((i*8) + j)] = 0
}
}
}
for (i.j.fused.inner, 0, 2) {
C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
}
}
}
It is wasteful because each iter of C, it only use 2 elements of B, but it will calculate the whole tensor B.
So I try to modify this parts of code. It seems very clear and easy to fix.
For example, if you merged to loop,
outer is of range 0-N,
inner is of range 0-M.
And then if you find you have to use elements of
fused loop of range i-j,
so you need to calculate outer from i/N to j/N, and inner of whole 0-M for safety,
or if you can assure i/M == j/M , you can set inner loop from i%M to j%M.
This will save memory and time and still maintain correctness.
Follow the rule I mentioned above, we can get the new code
produce C {
for (i.j.fused.outer, 0, 16) {
produce B {
for (i, 0, (4 - max((i.j.fused.outer/4), 2))) {
for (j, 0, 8) {
if (likely(((i.j.fused.outer/4) < (4 - i)))) {
B[((((i.j.fused.outer/4) + i)*8) + j)] = 0
}
}
}
}
for (i.j.fused.inner, 0, 2) {
C[((i.j.fused.outer*2) + i.j.fused.inner)] = B[((i.j.fused.outer*2) + i.j.fused.inner)]
}
}
}