Thanks for sharing your thoughts.
Let me share some more background. To achieve high performance for compute heavy ops (close to hand-written kernels like MKLDNN or ACL), we need to perform vector register tiling. This is one more level lower than cache tiling. Here, we have to carefully craft a TVM schedule that carefully manages the data reuse in vector registers, number of vector registers, number of vector FMA operations in innermost loop, number of vector memory accesses, and prefetcher friendly accesses. There are many factors to consider here, and a developer has to carefully craft the loop optimizations schedule to find a suitable balance. @kevinthesun can back me up here.
Now, simple optimization like Loop unrolling can completely offset this balance. For example, my TVM schedule might be keeping the total vector register count < 32 (# ARM vector registers), but LLVM unrolling even by a factor of 2 will double the vfma operations, defeating the whole purpose of loop tiling. I have dabbled in writing x86 assembly for SGEMM, and have experienced all these issues.
What about rerolling, unroll-and-jam and strip-mining?
I think reroll is disabled by default. Dont know about unroll-and-jam. Strip-mining is a TVM responsibility (it is just tiling in 1D for vectorization - common in TVM). But, I understand your overarching point. And yes, more strongly I am suggesting to give more control to TVM for these loop optimizations. I also believe that different loop optimizations will have different impact. I observed that LLVM unrolling has a big impact.
Default schedules to use LLVM optimizations?
I was thinking about this as well. And I completely agree. I want more control in compute-intensive ops, but I want LLVM to optimize for default schedules. Even further, if I can embed something in TVM IR to disable a loop optimization for certain section of LLVM IR, it might be the best design.
Mix and match of TVM optimization and LLVM optimization
Yes, this is same as previous point.
Summary
Why should we disable LLVM unrolling?
- TVM schedules performance behave as expected. A developer can trust his/her schedule for performance.
- This also helps in improving Auto-TVM that can be painfully long today. By carefully analyzing the loop structure, we can reason about how good register tiling is and discard bad configurations quickly.
- Disabling LLVM unrolling does not mean we will miss a configuration. Our schedules are templated. AutoTVM will have configuration where the axis that LLVM was unrolling, is now unrolled by TVM. (But, I understand we need data).
Why should we keep LLVM unrolling?
- Default schedules might see perf degradation.
- In short-term, top-hub might not be optimal anymore. We might need to re-tune.
If all of us see theoretical benefits and agree that the performance data is the only deciding factor, I can start collecting data for both x86 and ARM. Data collection will take time, so it is better if we agree on the idea first