One more thing I come up suddenly. We could have one documentation page to compare others NN compiler technology(such as XLA, GLOW) in our docs.tvm.ai like Clang do it (compare with GCC) in its site. I think many people are interested at it. For example, we could list: frontend framework support, hardware backend support, common model performance data, what companies are involved. @tqchen
+1 for unifying TVM IR across different layers for optimization. I think we’ve already seen the need/benefit of it when working on Relay runtime. It would be interesting to see Relay(v2)/NNVM(v3) to unify different IRs.
I would also like to point that this would be great. To be honest at the beginning of my search for compilers for DL I had a hard time trying to grasp what each one was better at than the other.
This presentation is indeed very interesting. Regarding multi-layer IR, I think it may be a Google’s answer to ONNX, which, as far as I understand, also aims at establishing some standard for representing ML models. We probably should wait for the specifications and then provide compatibility with Relay.
The really interesting thing is Polyhedral topic they announced. I think this case needs close attention and action. TVM may benefit of adopting techniques from Polyhedral world. As I mentioned in https://github.com/dmlc/tvm/issues/2588, we may want to include ISL as a third-party dependency and make experiments to become more familiar with the semantics.
Edit: Interesting news are coming from Tiramisu project which combines polyhedral model with scheduling language. https://arxiv.org/pdf/1804.10694.pdf . Note their criticism of Halide’s approach to scheduling.
+1 for this. I have some notes on some of the alternative deep learning compiler work. Can share those.
TVM definitely needs dependency analysis infra, we can start from this point, since isl has very powerful dependency analysis.
TVM Monthly - Feb 2019
So correct me if I am wrong at interpreting your view:
MLIR would be the abstract class of IRs, and IRs at different scopes (i.e. the scope of TVM IR is at a higher level than LLVM IR) would be derived classes of it. In these derived classes (MLIR dialects) the MLIR methods are overloaded (and extended) for specifics of the derived classes. Therefore TVM IR passes are these overloaded methods and we dont have to worry that MLIR will make TVM obsolete?
A couple of things which still kind of bugs me with this interpretation are:
- MLIR seems to (natively?) support descriptions targeted for polyhedral dialects. Since TVM does not use the polyhedral model, I guess we would be limited in the usage of this information but could it also be an impairment?
- If MLIR is the “root class” of all other dialects, then if a parser reads the highest level input form (ex. ONNX) and translates it into MLIR, how can we ensure that the parser doesnt omit any information which we need in the TVM dialect?
Good comments. I would like to separate the answer in two parts, and this is an updated view after I take look at the MLIR’s codebase.
Interpretation of MLIR’s Vision
I think what you answered reflects MLIR’s vision. Make the abstract class of IR and derive dialects. But not necessarily provide specific pass for the dialect, so if X-IR is a dialect of MLIR, then there are dialect specific passes that is needed in the pass.
Polyhedral dialect is a dialect in MLIR. In the current case, the polyhedral IR is part of the mlir codebase, which gives the view of “native”, but non-the-less it is a dialect just like the other automatic optimization dialect. The fact that it is part of the native code base does give an opinionated view of what what automatic optimization should be like in MLIR ecosystem. I think it is still very much an open problem, TVM has done a lot in this direction, and we can collectively innovate on this area.
How TVM can work with MLIR
First of all, MLIR won’t make TVM obsolete. In the contrary, it can help TVM stack by providing insights in IR design and possibly some lowering infrastructure.The community will keep improving our current IR infrastructure toward a better unified TVM-IR infra. We will try to define TVM dialects in MLIR to see if it makes sense to allow bi-directional translation between MLIR and TVM-IR, this way we can take benefit of some of the infra provided by MLIR and make TVM work together with MLIR’s ecosystem.
In my vision, there could be a vendor-neutral library that implements higher level MLIR dialect operators in lower (algebraic) level. There could be a graph optimizer, a tensor optimizer and a traditional compiler optimizer. Graph optimizer does higher level graph optimizations like fusion as well as serves as a driver. It partitions graph, inlines operators from the vendor-neutral library and directs selected partitions to the tensor optimizer. It also invokes traditional compilers for traditional global optimizations. It should also accommodate vendor-specific libraries by keeping them as intrinsics to be lowered into function/kernel calls. Tensor compiler will not see the dialects.
My take is,
MLIR is a replacement of HalideIR. 1) compiler infra support, like cfg/dfa/ssa, with these, we can avoid pattern matching style pass on Halide, which is not good for maintaining, 2) other better utilities, like text ir; 3) unified IR for multi-level, graph and tensor.
I agree the idea we have a MLIR phase in TVM. if it’s indeed better, we can move our work to MLIR gradually, or just write new optimization pass on MLIR.
Some of the good directions of MLIR like text format are already present in relay. And a natural next step would be the unification of relay and tensor level IR to bring a unified TVM IR.
Note that MLIR did not make the choice of IR design, so we still make a deliberate choice on how to design the IR. We could use more thoughts on “pattern-matching style pass” vs transformations on CFG(which we could move to a different thread). Note that both are pattern matching on different structures. In my experience, Halide’s loop nesting IR benefit from certain high-level information and quick prototyping. While CFG+SSA is good for codegen(role of LLVM).
Would love to know your opinions about customized data types, like strs, lists, dicts, etc.
Personally I feel it is rather high-level but is necessary in the long term if we want to represent complicated deep learning models.
Personally I am not quite into polyhedral optimization for now, mainly because most kernels in deep learning can get fine performance with handcrafted scheduling. For very computational intensive kernels we already have good vendor library support. Relatively, graph-level optimization is somehow more like low-hanging fruits.
Given some of the discussions are about designs of tvm’s IR, how about we start a new thread on that?
polyhedral optimization (or at least the ability to easily apply polyhedral-like analysis) might be attractive for ASICs though, it could help to build a smarter tensorizer.
It’s true. Handcrafting doesn’t scale when # of ASICs increases.
Hmm I dont think TVM really has a bigger problem of hand-crafting (read my comment to the next quote), also I think every ASIC developer would have to commit to “at least” defining TVM scheduling rules. Getting that for free would obviously be nice but I dont think its realistic. That scaling in # ASICs would be completely transparent to development of the TVM infrastructure.
There is some flexibility in TVM’s scheduling rules.
I mean given a certain layer-type with (or without) possible fusions, you can have more than one scheduling rule.
You would have a higher-level scheduling rule decision making module (which is purely SW) to actually pick which of the scheduling rules to use. Yes the scheduling rules are then hand-crafted, but most likely somewhat templated so that at least to some degree you can generate diverse “flavours” (imagine the block sizes and ordering of loops) of the routine.
I am no expert in polyhedral scheduling, but that sounds like very complex problem to solve (at least fully automated).
Polyhedral would technically not require these templates, but would require the scheduling algorithm to be conforming to the capabilities to the ASIC datapaths, address generation patterns, accelerator system resources (possible scratchpad usage), etc. This for any kind of operator fusion. Here I would guess that some templated schedules or constraints would again be handcrafted.
The set of loop optimizations that TVM natively supports is a subset of all possible with polyhedral, so it would be interesting to know which are not available (not even through a mix of TVM scheduling primitives). The only one I can think about is loop skewing (to generate a SW pipeline), but even then I have a mindsketch of how it could still be realizable without any extension of the TVM primitives.
If someone is a poly expert and totally against what I say __ please __ contribute to thread or contact me!!!
There is one thing which I think TVM could do better and would probably fit into the MLIR vision, and that is allowing the NNVM/Relay fusion rules of nodes to be an input from ASIC backend developers.
Obviously one path is to turn-off all fusion and then implement “glue fusion” routines which are more target dependent (each ASIC developer would have to do this), but I am not sure if it would break some of the reusability of TVM code (i.e. example TVM routines to visit nodes in a graph or something like that). I guess another path would be to overwrite some layer type definitions (ex: if I want to fuse conv and pool, then define pool as element-wise operation, again every ASIC developer would have to do this) but then again I have no idea what extra problems that brings down the road.
Good tensorizor is an open problem that we all need to solve. Poly do not have advantage or disadvantage in this problem. This is a technical direction we should push to solve in TVM.
The common part between Poly and TVM is the usage of integer and integer set analysis. I believe that is the direction where MLIR and TVM might collectively improve and learn from each other.
So the key idea here is to apply integer set analysis which we could call polyhedral or hypercube analysis
Good discussions here, the design principle of TVM stack is to “be intelligent and pragmatic”. This means we want as much automation as possible, but also provide ways to make use of human domain information like schedule template, tensorized micro-kernels when necessary. We will likely continue to use this principle.