Thank you for your response.
The target processor (Hexagon) has both a DSP and a vector multiplier (HVX), both of which benefited from pre-fetching in Halide, doing convolutions for example. It would be nice to use the TVM-based function to do something similar.
Yes, the distance will be critical. The folks working on Halide found behavior as you describe. But they found a significant seed-up for well-chosen parameters.
I have not yet looked at the timing improvements using (your?) TVM
prefetch() function. I am still trying to work out how it behaves. I see for example that if
s is a schedule built at
A == B or
B is contingent on
A, then the TVM IR is of the form
prefetch(some address in A, 0, 3, 1).
Can you tell me how the TVM-level call is translated into this IR (depending on the relationship between
B)? And what does the 0,3,1 represent?