Thank you for your response.
The target processor (Hexagon) has both a DSP and a vector multiplier (HVX), both of which benefited from pre-fetching in Halide, doing convolutions for example. It would be nice to use the TVM-based function to do something similar.
Yes, the distance will be critical. The folks working on Halide found behavior as you describe. But they found a significant seed-up for well-chosen parameters.
I have not yet looked at the timing improvements using (your?) TVM prefetch()
function. I am still trying to work out how it behaves. I see for example that if s
is a schedule built at B
, and
s[B].prefetch(A,axis,offset)
where either A == B
or B
is contingent on A
, then the TVM IR is of the form
prefetch(some address in A, 0, 3, 1)
.
Can you tell me how the TVM-level call is translated into this IR (depending on the relationship between A
and B
)? And what does the 0,3,1 represent?