Comparison between Tiramisu and TVM (and Halide)

derisavi-huawei · October 11, 2018, 5:52pm

I have recently read this paper: Tiramisu: A Code Optimization Framework for High
Performance Systems. The paper has a table summarizing comparison of Tiramisu with other similar compiler frameworks such as Halide.

Capture

My question is whether the comparison results stays the same (at least on the criteria discussed in the paper) if we compare Tiramisu with TVM (instead of with Halide).

chrisn · October 16, 2018, 8:11pm

From the paper:

Halide is an image processing DSL that has a sched-
uling language; however, it uses intervals to represent it-
eration spaces instead of the polyhedral model. This limits
the expressiveness of Halide. For example, unlike
Tiramisu, Halide cannot naturally represent non-rectangular iteration
spaces. This is the reason why Halide distributed over-
approximates the amount of data to communicate (send and
receive) when generating distributed code. This also makes
certain Halide passes over-approximate non-rectangular it-
eration spaces, potentially leading to less efficient code (for
example, it prevents Halide from performing precise bounds
inference for non-rectangular iteration spaces). It also pre-
vents Halide from performing many complex affine transfor-
mations, such as iteration space skewing.
Halide does not have dependence analysis and thus it re-
lies on conservative rules to determine whether a schedule
is legal; for example, Halide does not allow the fusion of
two loops (using the compute_with command) if the sec-
ond loop reads a value produced by the first loop. While
this rule avoids illegal fusion, it prevents fusing many legal
common cases which may lead to suboptimal performance.
Halide also assumes the program has an acyclic dataflow
graph in order to simplify checking the legality of a schedule.
This prevents users from expressing many programs with
cyclic dataflow. It is possible in some cases to work around
the above restrictions, but such methods are not general.
Tiramisu avoids over-conservative constraints by relying
on dependence analysis to check for the correctness of code
transformations, enabling more possible schedules.

It is great to see this space heating up. I think the first question is valid but maybe a broader discussion about having TVM be less tightly bound to halide would be wise.

tqchen · October 16, 2018, 9:10pm

It is interesting. Please note that while TVM uses HalideIR that is derived from Halide, most of the code generation and optimization passes are done independently(with deep learning workloads in mind), while reusing sensible ones from Halide. So in terms of low level code generation, we are not necessarily bound to some of limitations listed.

In particular, we take a pragmatic approach, to focus on what is useful for deep learning workloads, so you can find unique things like more GPU optimization, accelerator support, recurrence(scan). If there are optimizations that Tiramisu have which is useful to get the state of art deep learning workloads, we are all for bringing that into TVM

I also want to emphasize that TVM is more than a low level tensor code generation, but instead trying to solve the end to end deep learning compilation problem, and many of the things goes beyond the tensor code generation.

aca88 · February 25, 2019, 3:29pm

I was going to ask about loop skweing on a different thread, but I think I’ll try my luck here:
@tqchen
So from what I read Halide cannot do loop skewing, but TVM is more than Halide. So a very easy question is:
How do we implement skewing in TVM and are there examples?
Thanks

tqchen · February 25, 2019, 5:25pm

We do not atm. The main reason is that we have yet find an usecase in deep learning to support such transformation. If there is a case, we are open to discuss how to achieve that

derisavi · February 26, 2019, 4:27pm

One of the motivating reasons, in deep learning workloads, for loop skewing transformations that was raised by Tiramisu authors during CGO 2019 was in the following scenario.

Let’s say we have a multi-layer LSTM. When we unroll the time steps, we end up with a number of LSTM cells in a rectangular grid where computation at each LSTM cell is dependent on the two cells on its left and bottom. Given these dependencies, one can still perform simultaneously the computations of all cells that lie on a diagonal. Then, to bring all those cells on a diagonal (involves two loop vars) into a single loop, you need to apply loop skewing.

It’s an interesting use case but IMHO, there is already enough parallelism in the computation of each cell such that we may not need to perform multiple of them at the same time. Has anyone seen any work that exploits this parallelism in practice and achieves higher performance?

tqchen · February 26, 2019, 4:30pm

That particular case usually can be handled by a dynamic task scheduler, like the one used by mxnet. In which case we dispatch the layers with their dependencies, and we will have a wave form pattern. So we don’t necessarily have to put that in the tensor expression level.

aca88 · February 27, 2019, 1:38pm

So in the LSTM example, we can process in parallel LSTM cells of different layers with the inputs from different times?
I can see this being handled by something as tqchen said because they are routines from different layers.

Wouldn’t loop skewing help to program systolic array HW? For example stuff like Eyeriss

Assuming that the HW works more like VLIW (where we have to control the low level dispatching) and not like superscalars

summerspringwei · October 27, 2021, 9:16am

Actually there is not enough parallelism in the computation of each cell. You can see “ Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks” for more details.