Disabling LLVM Unrolling

anijain2305 · March 21, 2020, 12:59am

I have been working on TVM schedules for ARM. One thing that I notice is that LLVM has its own unrolling heuristics, that can completely mess up the analysis that one does for unrolling in TVM.

For example, a developer can choose to unroll a particular axis with the goal of better reuse utilization in vector registers. But, LLVM can unroll one more axis that can completely mess up the vector register locality, causing lots of memory spills.

Since, it is TVM responsibility to find whether or not to unroll, would it make sense to disable LLVM unrolling? Disabling LLVM unrolling is pretty easy - https://llvm.org/doxygen/classllvm_1_1PassManagerBuilder.html#a6c669ca5ca0a6b7e47a9fe9ae9aaa32d

@tqchen @kevinthesun @haichen @yzhliu @FrozenGene

haichen · March 21, 2020, 4:11am

It’s an interesting issue that LLVM optimization could mess up with TVM’s. But sometimes it also helps improve the performance. We probably need to do more study on a few other ops and x86 target and see if disabling unrolling could cause performance regression. Otherwise, I have no objection to the proposal.

zhiics · March 21, 2020, 5:48am

This is an interesting observation. I personally didn’t play with the schedules in TVM before and I haven’t seen how LLVM’s loop transformation would affect TVM’s positively or negatively. But here is my two cents. I think we probably should not directly disable it. Unrolling is just one of optimizations that could happen after we have LLVM IR. A bunch of other optimizations could happen as well, i.e. loop rerolling, loop unroll-and-jam, and strip-mining, etc. For instance, these ones are usually related to unrolling or could happen after it. Should we disable them as a whole?

If we want to disable more transformations, another argument I have would be: should we allow the ops that we only provide default schedules to enjoy performance benefit obtainable from LLVM at all?

My point is we may need to see more evidences or have more benchmarking results before we proceed with this decision.

ANSHUMAN.TRIPATHY · March 21, 2020, 7:19am

@anijain2305: Thanks for such great findings! I wonder whether it is really possible to keep the enable and disable of such optimization at low level more dynamic than static. Where it should be dependent on target performance rather than condition of TVM optimization. In this case too, we can maintain a historical data which can tell us whether mix and match of TVM optimization and LLVM optimization can yield an optimum result. And based on that, we can predict the fate of Enable/Disable of such optimization. I believe the LLVM scope for optimization should not intersect with TVM scope for optimization, otherwise it will be like total collapse of hierarchical structure. This is just a thought! I am eagerly observing responses from all the experts in TVM community.

anijain2305 · March 21, 2020, 5:27pm

Thanks for sharing your thoughts.

Let me share some more background. To achieve high performance for compute heavy ops (close to hand-written kernels like MKLDNN or ACL), we need to perform vector register tiling. This is one more level lower than cache tiling. Here, we have to carefully craft a TVM schedule that carefully manages the data reuse in vector registers, number of vector registers, number of vector FMA operations in innermost loop, number of vector memory accesses, and prefetcher friendly accesses. There are many factors to consider here, and a developer has to carefully craft the loop optimizations schedule to find a suitable balance. @kevinthesun can back me up here.

Now, simple optimization like Loop unrolling can completely offset this balance. For example, my TVM schedule might be keeping the total vector register count < 32 (# ARM vector registers), but LLVM unrolling even by a factor of 2 will double the vfma operations, defeating the whole purpose of loop tiling. I have dabbled in writing x86 assembly for SGEMM, and have experienced all these issues.

What about rerolling, unroll-and-jam and strip-mining?

I think reroll is disabled by default. Dont know about unroll-and-jam. Strip-mining is a TVM responsibility (it is just tiling in 1D for vectorization - common in TVM). But, I understand your overarching point. And yes, more strongly I am suggesting to give more control to TVM for these loop optimizations. I also believe that different loop optimizations will have different impact. I observed that LLVM unrolling has a big impact.

Default schedules to use LLVM optimizations?

I was thinking about this as well. And I completely agree. I want more control in compute-intensive ops, but I want LLVM to optimize for default schedules. Even further, if I can embed something in TVM IR to disable a loop optimization for certain section of LLVM IR, it might be the best design.

Mix and match of TVM optimization and LLVM optimization

Yes, this is same as previous point.

Summary

Why should we disable LLVM unrolling?

TVM schedules performance behave as expected. A developer can trust his/her schedule for performance.
This also helps in improving Auto-TVM that can be painfully long today. By carefully analyzing the loop structure, we can reason about how good register tiling is and discard bad configurations quickly.
Disabling LLVM unrolling does not mean we will miss a configuration. Our schedules are templated. AutoTVM will have configuration where the axis that LLVM was unrolling, is now unrolled by TVM. (But, I understand we need data).

Why should we keep LLVM unrolling?

Default schedules might see perf degradation.
In short-term, top-hub might not be optimal anymore. We might need to re-tune.

If all of us see theoretical benefits and agree that the performance data is the only deciding factor, I can start collecting data for both x86 and ARM. Data collection will take time, so it is better if we agree on the idea first

ramana-arm · March 22, 2020, 11:02am

In my experience plain loop unrolling has always been a blunt hammer and is not useful in the general case, thus turning that off by default in LLVM makes sense. Targeted unrolling with vectorization and other loop optimizations is more beneficial .

I hadn’t realized that LLVM turned on plain loop unrolling by default and yes this can hamper performance. There are some workloads where it can benefit, but in the TVM case I can see it hampering more than helping. At the very least playing with that would make a lot of sense.

kevinthesun · March 23, 2020, 12:07am

This is a quite valuable topic which can help us figure out what kind of information related to optimization we can get from TVM IR itself, after all the LLVM optimization pass applied. For x86 conv2d, my observation is that the work llvm unrolling is doing can be implemented in TVM schedule by adding more unroll options. Though I haven’t seen any mess up case like arm cpu, I think it still helpful for predicting whether we have a good schedule template or not, by disabling LLVM unroll. I agree with the point from @zhiics about when we should turn off llvm unrolling. We might want to keep llvm unroll on for those non-compute-intensive ops which doesn’t have such detailed schedule template. For compute-intensive op which we already have quite complicated templates, we might want to turn off llvm unrolling to let TVM developer control more aspects of optimization. Still, to make the final decision, we need more benchmarking data(And some changes to existing schedule template).

kparzysz · March 25, 2020, 1:51pm

What we did in our backend is that when the user creates a target, they can pass extra arguments to the tvm.target.hexagon() function. These arguments include an optional list of options that will be passed to the LLVM backend. These are the “internal” LLVM options (i.e. those that you pass with -mllvm in clang invocation), but this way we can control unrolling on a per-use basis.