[RFC] MLIR Frontend

RFC for Relay MLIR Frontend

Authors: Yanan Cao, Yong Wu Contributors: Yida Wang, Haichen Shen, Yao Wang

Summary

We propose a solution that can give TVM/Relay top-notch model/op coverage for TensorFlow with affordable effort.

Motivation

TensorFlow, as the most dominant machine learning framework, has a scarily large number of operations, in fact over 1200+ of them, out of which 500+ are commonly used computational ops. TVM currently has hand-written lowering functions for 145 of them from TF to Relay. These 145 operations are enough to support some well known computer vision models like ResNet and SSD. This is far from enough when users want to bring custom models with extra operations.

Proposal

Instead of converting from TF→Relay directly, we can consider using XLA HLO as an intermediate step to reduce engineering cost, namely TF→XLA HLO→Relay .

XLA is a Tensorflow-oriented machine learning compiler with XLA HLO as its frontend IR. HLO is designed exactly to address the problem of TF op set overwhelming a compiler.

HLO has a few characteristics that, I believe, make it a great target for Relay to import from:

  • HLO Operation Set is tiny : At the time of writing, there are <100 operations. Compared to number of TensorFlow operations, it is much easier to implement lowering from HLO→Relay than TF→Relay directly.
  • HLO Operation Set is stable : Rarely are any new operations added to HLO, saving us the trouble of playing catch-up game.
  • TensorFlow GraphDef → HLO lowering support is top-notch : In TensorFlow repo, there are symbolic execution kernels that can lower ~400 TensorFlow op to XLA HLO, covering vast majority of mathematical operations model creators use. They are all well tested and maintained.
  • Other than TensorFlow, major frameworks like PyTorch, JAX and Julia all have different degree of XLA lowering support. This means we can get additional model/op coverage for these 3 frameworks at almost no additional cost .

However, XLA HLO isn’t all perfect, it requires strict static shapes for all tensors and operations. This limits model coverage and sometimes requires modification to model in order to compile. Luckily, with MLIR and its HLO Dialect, representing dynamic shapes with HLO ops is now possible. More details can be found in lowering path design.

TensorFlow Lowering Paths Design

Lowering Flow Diagram

Instead of lowering from TensorFlow to HLO directly, we could alternatively leverage MLIR and its HLO dialect for faster implementation and support for model dynamism.

To be more specific, MLIR is a compiler infrastructure that helps define dialect IRs and their corresponding compilers. In our context, the most relevant dialects are:

  • MLIR-TF Dialect is an isomorphic representation of TensorFlow GraphDef.
  • MLIR-HLO Dialect is a near-isomorphic representation of XLA HLO. It is different from HLO in that it 1) represents dynamic shape 2) Has a slight different set of operations, but can still round-trip (convert to and from) vanilla XLA HLO.
  • MLIR-Relay Dialect should be isomorphic to Relay with a trivial conversion between the two.

Components needed:

  • tf-mlir-translate tool: Needs to be enhanced to functionalize Switch/Merge control flow nodes and automatically identify input/output node names/shapes.
  • tf-opt tool: Contains lowerings from MLIR TF to MLIR HLO. This is in active development by Google. There are currently ~150 operations implemented.
  • MLIR Relay Dialect and its thin conversion to Relay
  • Lowerings from MLIR HLO to MLIR Relay: Using MLIR infrastructure, this can be implemented with less effort than writing custom converter in option 1.

XLA HLO → Relay

To prove viability, we created a table with a rough mapping from each XLA HLO operand to their Relay equivalences. This table is created after referencing HLO operations and their semantics, Relay TOPI, HLO OpCode Definitions, XlaBuilder.

In following table, you can see all XLA HLO ops, in fact even more than what HLO has. This is because MLIR HLO and vanilla HLO ops, though largely overlap, still have some differences. Since I believe MLIR-HLO lowering path is better due to support of dynamic shapes, I listed all the vanilla HLO-only ops at bottom of the table with comment claiming that they are not in XLA Builder API.

XLA HLO Needed? Equivalent Relay Ops Comment
AfterAll No Control dependency operation, not applicable to Relay
AllReduce No Multi-core only, not applicable to single-core inference
AllToAll No Multi-core only, not applicable to single-core inference
BatchNormGrad No training related
BatchNormInference Yes batch_norm
BatchNormTraining No training related
BitcastConvertType Yes reinterpret Maps to BitcastConvert in hlo_opcode
Broadcast Yes broadcast_to slight difference in “shape” input, maps to broadcast in hlo_opcode
BroadcastInDim Yes broadcast_to maps to broadcast in hlo_opcode
Call Yes
Cholesky Yes composable with existing ops, formula here (https://www.tensorflow.org/xla/operation_semantics#cholesky)
Clamp Yes clip
Collapse Yes reshape Only in XlaBuilder API, maps to reshape in hlo_opcode
CollectivePermute No Multi-core only, not applicable to single-core inference
Concatenate Yes concatenate
Conditional Yes relay.expr.If
Conv Yes conv2d_{nchw, hwcn}, conv1d, conv3d Limited support for 3-4-5 ranks
ConvWithGeneralPadding Yes conv2d_{nchw, hwcn}, conv1d, conv3d Limited support for 3-4-5 ranks, maps to convolution in hlo_opcode
ConvertElementType Yes cast Only in XlaBuilder API, maps to “convert” in hlo_opcode
CrossReplicaSum No Only in XlaBuilder API, multi-core only, not applicable to single-core inference
CustomCall No This is a way to invoke arbitrary CPU code, very rarely used.
Dot Yes multiply Only in XlaBuilder API, the “dot” in hlo_opcode actually maps to DotGeneral
DotGeneral Yes reshape+multiply
DynamicSlice Yes strided_slice
DynamicUpdateSlice Yes May need to add new op into relay
Add Yes add
Sub Yes subtract
Mul Yes mul Maps to multiply in hlo_opcode
Div Yes divide
Rem Yes divide, subtract Maps to “remainder” in HloOpcode
Max Yes max
Min Yes min
And Yes logical_and
Or Yes logical_or
Eq Yes equal Maps to “compare” in HloOpcode
Ne Yes not_equal Maps to “compare” in HloOpcode
Ge Yes greater, equal Maps to “compare” in HloOpcode
Gt Yes greater Maps to “compare” in HloOpcode
Le Yes less, equal Maps to “compare” in HloOpcode
Lt Yes less Maps to “compare” in HloOpcode
Abs Yes abs
Ceil Yes ceil
Cos Yes cos
Exp Yes exp
Floor Yes floor
IsFinite Yes isfinite
Log Yes log
Not Yes logical_not
PopulationCount Yes May need to add new op into relay
Neg Yes negative
Sign Yes sign
Tanh Yes tanh
Fft Yes May need to add new op into relay
Gather Yes gather_nd
GetDimensionSize No ndarray_size Needed only for XLA dynamic padder
SetDimensionSize No Needed only for XLA dynamic padder
GetTupleElement Yes TupleGetItem
Infeed No not inference related
Iota Yes May need to add new op into relay
Map No Found no use in OSS TensorFlow
Pad Yes nn.pad
Recv No Cross device communication, not applicable to single-core inference
Reduce Yes Cannot support full flexibility because Relay doesn’t allow op to take function as argument. However, we can pattern match to support common cases.
ReducePrecision Yes cast, cast
ReduceWindow Yes Cannot support full flexibility because Relay doesn’t allow op to take function as argument. However, we can pattern match to support common cases.
ReduceWindowWithGeneralPadding Yes Cannot support full flexibility because Relay doesn’t allow op to take function as argument. However, we can pattern match to support common cases.
ReplicaId No Needed when data parallelism is involved
Reshape Yes reshape
Rev Yes reverse
RngNormal No Inference graphs should not need RNG ops
RngUniform No Inference graphs should not need RNG ops
Scatter Yes May need to add new op into relay
Select Yes select
SelectAndScatter Yes Same problem as Scatter, may need to add new op into relay
Send No Corresponds to “Send” and “SendDone” in HloOpcode, cross device communication
Slice Yes strided_slice
Sort Yes Argsort Cannot support full flexibility because Relay doesn’t allow op to take function as argument. However, we can pattern match to support common cases.
Transpose Yes transpose
TriangularSolve Yes Should be composable with existing Relay ops, semantics here (https://www.tensorflow.org/xla/operation_semantics#triangularsolve)
Tuple Yes tuple
While Yes Recursive calls
AddDependency No Only for XLA internal use
Atan2 No Not in XLA builder API
Clz No Not in XLA builder API
Compare No Not in XLA builder API, internal implementation for all comparisons, no need to support separately
Complex No Not in XLA builder API
Constant No Not in XLA builder API
Copy No Not in XLA builder API
CopyDone No Not in XLA builder API
CopyStart No Not in XLA builder API
Domain No Not in XLA builder API, only for partitioning computation, no need to support
Expm1 No exp, subtract Not in XLA builder API
Imag No Not in XLA builder API
Log1p No log, subtract Not in XLA builder API
Parameter No Not in XLA builder API, represents input to computation
PartitionId No Not in XLA builder API, only needed for multi-device computations
Power No power Not in XLA builder API
RngGetAndUpdateState No Not in XLA builder API
RoundNearestAfz No Not in XLA builder API
Rsqrt No rsqrt Not in XLA builder API
ShiftLeft Yes Not in XLA builder API
ShiftRightArithmetic Yes Not in XLA builder API
ShiftRightLogical Yes Not in XLA builder API
Sin No sin Not in and not used by XLA builder API
Sqrt No sqrt Not in XLA builder API
Trace No Not in XLA builder API, only for profiling and tracing
TupleSelect No Not in XLA builder API
Xor Yes xor Not in XLA builder API

We can see from the table that:

  • Most of operations have a straightforward one to one mapping
  • A few operations require decomposition into several existing Relay ops, like Cholesky, TriangularSolve
  • 4 HLO operations require addition of new Relay Ops
  • 6 operations can be partially supported, namely convolutions (Relay supports rank <= 5), reduce and sort ops (due to lack of fully-flexible function as op argument support). However most common use cases are covered, like low-rank tensor convolution, ReduceMax, ReduceMin, ReduceMean, ReduceSum etc.

Overall, even though HLO→Relay Coverage isn’t perfect, I believe it is enough for all us to cover all but the most extreme cases, like rank>5 convolution, reduction ops other than Min/Max/Sub/Mean etc.

Summary

We believe that TF→HLO→Relay is a good alternative path to address model/op coverage that minimizes amount of effort while making TVM a first-class TensorFlow backend . Additionally, same work can be reused to acquire coverage for PyTorch, JAX and Julia.

13 Likes

Thanks for the nice RFC. And happy to see folks other than us also pay attention to the MLIR-as-a-bridge design to integrate TVM as a backend for TensorFlow(or maybe more than TensorFlow^-^).

Inside Alibaba, we are also working on the related things.

To be more specific, for static shape JIT compilation scenario, we heavily customize XLA, such as adding aggressive optimization to tease performance, enhancing its infrastructures to make sure it can be turned on by default for lots of production workloads. Also some of our colleagues already implemented an internal version for integrating TVM as a backend of XLA:).

For dynamic shape, we do leverage MLIR since in our opinions it has native support for dynamic shape(some of our thoughts haven been reflected in this MLIR ODM), better modularized design philosophy and enable use to integrate different pieces of our AI compiler in a unified approach. It would be more than happy to share some of our ongoing work and to see whether there is potential collaborations with you folks.

Thanks

1 Like

Nice proposal, one fruit for thoughts would be whether can we have a hybrid TF / HLO combo, which is made possible under the TVM unified IR infra.

The idea is that we could run a cut during conversion, making MLIR function as a possible extern function in the IRModule under the TVM unified IR. We run the high level translation(from TF high-level) to relay, cut up the un-supported fragments as MLIR sub-functions(that relay can call into), then use the HLO lowering to lower these tvm::MLIRFunc that are not covered by the TF->relay translator. The advantage is that we can preserve high level info when necessary but still is able to make use of HLO for coverage

Thank you for bringing up this great proposal! Like @yangjunpro mentioned, we at Alibaba, along with @lfengad @dingli have move along this direction for quite some time now. We choose GraphDef->HLO->Relay for the exact same reasons, that it is a stable and concise IR to work on. Glad to be looking at a concrete analysis table on discrepancy on op sets of different IR!

The work we have done is to use XLA framework to cluster a subgraph of tensorflow and use TVM as codegen, the rest remain executed by Tensorflow itself. We now have a demo running with resnet18 and mlp (where most of network can be clustered to tvm part). We made this choice to fully utilize the extensive op set of Tensorflow Graph that allow flexible neural network construction, while for those TVM can do better in code generation, we offload them to TVM for better performance and our demo is showing promising results.

MLIR is definitely on our road map to be the future IR infra, one of the reason is to leverage "a hybrid TF / HLO combo made possible under the unified IR infra." as mentioned by @tqchen. Also dynamic shape is a true problem for HLO IR, that we hope HLO dialect in MLIR is helpful on that regard, and we have many colleagues working on that as well.

One concern is still about node coverage. While relay op set is growing, we still roughly see

Relay < HLO < TF Graph

The unsupported part can be codegened by MLIR as @tqchen discussed (in that case TF/HLO combo conversion becomes even more important as TF Graph is not necesarily covered by HLO), or the other way around like what we did to reuse tensorflow runtime. But at this stage, we are more than happy to have an MLIR dialect/frontend for relay and we anticipate to exchange our thoughts and experience on this.

My specific comment is under the context of the TVM’s unfied IR infra, to elaborate further.

The current TVM’s new IR Infra incorporates three variants of functions in the same IRModule, tir::PrimFunc, relay::Function, and we will have additional ExternFunc for MLIR variants of that to represent an MLIR dialect functions. And use the same set of type system to represent the function signatures so that we can freely enable calls among the functions.

To enable a hybrid comobo translation, here is the ultimate TVM ingestion path that we can envision: start from TF GraphDef

  • Step 1: GraphDef -> tvm::IRModule: translate as much as GraphDef as we can, including high-level control flows, the rest parts of are collapsed as several tvm::ExternGraphDefFunc. The fragments becomes relay calls into these tvm::ExternGraphDefFunc
  • Step 2: Lower the remaining GraphDef fragments(in tvm::ExternFunc) to HLO (as tvm::ExternHLOFunc)
  • Step 3: Translate the rest of the HLO functions to relay

Step 1 ensures us to cover high-level important operators, as well as control flow constructs so that they can be natively converted without losing the information. Step 2-3 ensures more coverage through HLO. By combining the two, we will be able to get the best of both worlds.

Generalizing a bit, because we can represent different variants of functions under the same tvm::IRModule, such kind of multi-stage lowering might not only be relevant to TF, but also other frameworks with multi-stage IRs. The conventional view was to only tap into one of them, but the new design allows us to tap into multiple of them and gives more advantages under the performance/coverage tradeoffs.