[IR] Unified TVM IR Infra

Introduction

The TVM stack has been evolving for more than two years now. The current compiler stack contains several components that were designed at different points in time. This RFC proposes a unified IR infrastructure for the TVM stack by combining past lessons.

From a high-level the proposed infrastructure will consist of:

  • A unified module, pass and type system for all IR function variants.
  • Two major variants of IR expressions and functions: the high-level functional IR(relay)and the tensor-level IR for loop optimizations.
  • First-class Python and hybrid script support, and a cross-language in-memory IR structure.
  • A unified runtime::Module to enable extensive combination of traditional devices, microcontrollers and NPUs.

Unified IR Module and Pass Infra

The main component of our proposal is to introduce a unified IRModule structure that can contain different variants of functions(relay::Function/te::Function). The namespace te(tensor expression) is a tentative namespace for low-level functions, and comments and suggestions about name choices are more than welcome.

This change will simplify the intermediate representation data structures and terminology used in our previous compilation flow, which can be shown as follows:

 importer: model -> 
 high-level optimizations: relay::Module -> optimizations -> relay::Module  ->
 low-level optimizations: compute/schedule declaration -> Stmt-> ir passes ->Stmt ->
 device-specific codegen: LoweredFunc -> runtime::Module

As shown above, our current flow has different intermediate data structures at different stages (relay::Module, Stmt, LoweredFunc) with different terminologies. Under the new design, we will have a single module structure, and transformations between IRs become ir::Module to ir::Module transformations.

More importantly, we can use specific calling conventions, enabling different function variants to call each other. The following code snippet is a mock-up to demonstrate a module containing both a relay.Function and te.Function. The relay_add_one function can call into the te_add_one function using the destination-passing convention, where outputs are passed as inputs to the function.

def @relay_add_one(%x : Tensor((10,), f32)) {
    call_destination_passing @te_add_one(%x,  out=%b) 
} 

def @te_add_one(%a: NDArray, %b: NDArray) {
    var %n
    %A = decl_buffer(shape=[%n], src=%a)
    %B = decl_buffer(shape=[%n], src=%b) 
    // body ir contents need to be evolved
    for %i = 0 to 10 [data_par] {
        %B[%i] = %A[%i] + 1.0 
    }
}

Enabling both high-level functions(relay.Function) and tensor-expression functions to coexist in the same module enables the potential for cross layer optimizations. Of course, most of the transformations will only operate on one type of function and will simply ignore other functions in the same module.

Most importantly, the proposed change will minimize concepts for developers. Developers only need to know about ir::Module and runtime::Module. Every transformation is ir::Module -> ir::Module.

AutoTVM and other schedule transformations can be viewed as an intelligent way to transform an ir::Module. We are exploring ways to embed the current tensor expressions as a special type of Function in ir::Module.

Discussions

One of the design questions is whether to unify the te::Expr and relay::Expr into a single base-class. relay::Expr allows tensor types and supports broadcasting in operations. te::Expr only points to primitive data types(int, float, pointers, buffers). Unifying them into a single base allows reuse of certain AST nodes and potentially mix the expressions. On the other hand, combining two namespaces brings additional complexity of mixing expressions that could be invalid. The separate namespace also allows us to use the low level expression in the tensor types to express shape constraints. Given these considerations, thus we suggest that the two needs to be separated, at least for now. Note that however, both use the same FunctionExpr, to enable calling across different types of functions.

Another design question is how to express shape and data type buffer constraints in the te.Function. One option is to express it as a constraint in the type and make te.Function polymorphic to n. While this approach works well for shape constraints, it is harder to express the sharing of data field for in-place update. The current alternative is to express the constraint via bind expression, we can also make bind information as a part of the function signature, but not as type. This signature is a more faithful representation of the final generated code. This approach allows us to make use of the constraints during analysis. It does introduce less information for shape-related typing checking when a relay function calls into a te.Function. We can resolve that by providing an auxiliary function that reconstructs the function type with constraints.

External Function Interpolation

Besides relay.Function and te.Function, we can also introduce other external function types. As long as we define a clear function calling convention into these modules. For example, we can introduce an ASM function to embed external assembly code. We can also incorporate functions expressed in other IRs such as TorchScript, MLIR and LLVM to make use of these ecosystems.

The only limitation of external function is the ability to do cross-function optimizations (as special code-path has to be written for each external IR). Given most of our use-cases partitions functions in coarse grained matter, we expect that such impact will be low, as long as the major chunk of the code is optimized using the unified in-house variants.

First-class Python and Hybrid Script Support

Python is the most popular language for deep learning frameworks due to its great flexibility and rich ecosystem. We plan to provide first-class python support. This is also a fundamental infrastructure design choice that makes TVM different from other stacks.

Specifically, we expose a programmatic interface which allows developers and researchers to customize compilation, and write new passes in Python. All IR data structures can be created, manipulated and transformed using Python APIs. We will leverage the rich ecosystem of Python ML to further explore ML guided compilation.

Additionally, we plan to specify a subset of python AST that can express every possible function in the TVM IR. The new dialect will be an extension of the current TVM hybrid script, and will serve as a way to construct and inspect the IR in Python. Eventually, it can serve as a secondary text format. In order to cover all the possible IR nodes to enable round trip, we will take a gradual upgrade approach by introducing the meta keyword in the hybrid script similar to the meta support in the current relay text format. The meta is a dictionary of IR nodes that are opaque in the text format but can be populated by json blob. We can always serialize the IR components that are not in the hybrid dialect specification into meta and improve the dialect specifications gradually by adding more native constructs. The code below gives a same function in the hybrid script format.

ML compilation is an open research area, while its great value has already leads to quick transfer to production. We believe these additional flexibilities will increase the rate of innovation, and are critical to help accelerate the wildly open field of deep learning compilation. When the optimization passes become stable, we can seamlessly move passes directly into the C++ core when product ready.

Pave Ways for to Other Languages

Besides the python support, the cross-language runtime object protocol also paves ways for bringing native compiler support into other languages besides python and C++. For example, one possible candidate is rust based the current community interest. We can build language bindings that allows us to write additional compilation passes in rust and expose them to the core tvm compiler. We already provide runtime support on python, rust go, javascript, java.

Extensible Unified Runtime Module

Because TVM stack supports different kinds of compilation targets, we need a unified runtime interface to expose the compiled module to the developer. runtime::Module is an abstract interface for all possible compilation targets. Codegen is defined as a transformation from ir::Module -> runtime::Module. The current runtime module interface already handles the serialization(as shared library) and runtime linking(via PackedFunc).

As an artifact of compilation, we will get a module that imports several modules of different types. For example, we could have a graph runtime module that calls into host functions defined on the DSOModule, which then calls into CUDA module to invoke device functions. It is important to create an extensible mechanism to be able to run and export any heterogeneous runtime modules. We have an on-going RFC that proposes a unified exportation format to package all possible runtime::Module into a single shared library.

We can also evolve the interface to better handle cases such as hetrogenuous devices and flexible task scheduling.

Discussions

One tension in the runtime design is whether to introduce a more opinionated view of serialization format (e.g. ONNX for model serialization) or allow each module define their own format. We eventually use the later to give more flexibility to each of the module, and only ask each module to expose necessary interface for a unified packaging. Any module packaged in this way will immediately enjoy the benefit by interacting with other components of the stack, including automatic exposure python, java and other runtime APIs, and use of the RPC infrastructure for scalable remote profiling.

Relation with other IRs

Deep learning compilation is an active field and there are other related efforts such as MLIR, LLVM, TorchScript, TensorRT and vendor specific compilation and runtime toolchains.

We believe the most important principle as an open source project is to allow our developers to tap into other ecosystems. In particular, we can incorporate these compilation flow and runtime into tvm by using ExternFunc in the compilation flow, and a specific runtime in the unified runtime interface.

We would also like to facilitate cross IR translations, for example, TorchScript to Relay importer and expose compilation back to a PyTorch compatible function. Another area of interest would be to create translation from MLIR-TF dialect to TVM IR, and make enable MLIR functions as part of ExternFunc support to give better support and interpolation for TF ecosystem.

The scope of TVM’s IR remains very focused – automating the optimization of deep learning workloads on diverse hardware backends and enable our developers to easily integrate the TVM into their stack.

Timeline Upgrade Path

Once we agree on the general design, we can step several refactor steps to bring the current codebase to the new IR infrastructure.

  • Introduce ir::Module and ir::Function
  • Move relay::Module to make use of ir::Module
  • Move the current low-level IR infra ir_pass and LoweredFunc transformations to Module->Module
  • Introduce text format and hybrid script specifications of the unified ir::Module
  • Continue to evolve the design of the IRs themselves

Please share your comments:)

12 Likes

Thanks Tianqi for proposing this RFC, especially in such a swift manner right after the completion of Shanghai meetup:)

One thing I think important to make the discussion of this RFC converge faster is to provide more concrete examples and use cases.

e.g, by mixing optimization with in different stages(Relay and Tensor Expression IR), what kind of benefit that may bring?

Several small comments:

  • Is it really necessary to expose the first-class Python support to let ML-guys practice with more aggressive and fancier stuffs such as AutoSchedule. Since in my opinion, only exposing the python support is not enough, ML-guys still need to know the details about the IR representation. By providing too many flexibilities, it may also bring mind overhead since there are multiple ways doing the same things;

  • Regarding to the unification of runtime, I like this idea. One thing may deserve to be pointed out is that ideally multiple different runtimes should work happily simultaneously. However, in practical case, multiple runtime may be at odds with each other due to several detailed things like memory management, thread management, etc. So in my opinion it is tempting to design an unified runtime architecture, however this may bring some potential side effect. Another design choice is to make TVM as a compact core component and let other system/runtime integrate TVM. In Alibaba cloud, this is what we usually called for our product—“to be integrated”:). Also some key algorithms some as graph partition algorithm should be put at TVM’s radar. In one word, I am a little bit concerned about the practical result of the unification of runtime. But I do think several key components for integrating TVM with other system/runtime/engine should be provided by TVM such as graph partition algorithm.

Thanks.

Thanks @yangjunpro for great points, The purpose of RFC is to get everyone’s idea so that our eventual design would be even better.

Here are some of my thoughts:

Benefit of Having Function Variants in the Same Module

One potential benefit is that we might get a module that fit this situation naturally in compilation. For example, when we compile a program, the primitive operations will need to be optimize in the low-level form, while the high-level function that calls into these primitive ops remain in the functional form. Being able to represent such state of compilation allows us to look at them together. For example, propagate tied memory allocation decisions into the low-level function(from the high-level callers), select the right calling convention(if there are multiple ones).

Of course, most of the current optimizations we know are still tiered, and we will need to see how will the future developments unfold.

First-class Python Support

First-class python support is not the necessary condition for ML, but it will accelerate innovations. If there is a known algorithm X that is the ultimate silver bullet, then perhaps we could go and solve that by implementing X in the language we like. On the other hand, we all agree that there is yet a silver bullet. In the case of the open area such as ML compilation, I think the best way to reach our final goal is to enable more people – ML scientists, compiler engineers and deep learning system engineers to innovate together.

Of course, we will introduce layered abstractions, so ML scientists can choose to get directly representations in graphs, feature, or cost model, but still provide mechanism to go into these layer if necessary, and switch between language boundaries.

First-class python support’s impact also goes beyond the ML-based optimization themselves. The same mechanisms can be used by developers and researchers to easily integrate tensorized micro-kernels, supporting new accelerators, or simply as ways to write test-cases for C++ base compilation pipelines.

Finally, the support is build in a way that does not compromise production usage. In contrary, it enhances the speed to the production. Given that all data structure is accessible in languages such as C++ and python, we can quickly migrate the right idea into the C++ pipeline. If we choose to have c++ based solution in the first place, it still serves as a way to quickly write test cases and complicated logics in a compact fashion.

Unification of Runtime

I want to echo your thoughts about the value of compactness and minimalism of the core tvm runtime and the ability to integrate into larger systems. That is exactly our design goal. I would also be very worried if we start to dictate choices of threading, model packaging, serialization or allocation.

There are several roles that a runtime Module API can play:

  • F1: Interface to the user of the runtime when embedding it to a bigger system, keeping it minimum and simple is the primary concern as you mentioned.
  • F2: Ability to package the necessary states and reconstruct them when runtime contains multiple components.
  • F3: Ways to allow modules to work together with one and another(e.g. host calls into GPU code, graph runtime calls into primitive functions).
  • F4: Auxiliary features for profiling, data collection(for ML models), and execution.
  • F5: Runtime resource orchestration(memory, threads).

To reword your concerns, we want a minimum runtime (F1), and would be worried that if we dictates F5, they would interference with other parts of the system. The way to resolve this challenge is to define a minimum modular, interface, and each module to have their own way of exposing functions(by defining ways to return PackedFunc), and serialization – rather than dictating these factors.

As a result, the unified runtime interface remains minimum to allows developers to integrate the runtime into their own system, via the same set of runtime API(F1). The runtime support is also modular, in a sense that developers can choose to pick the runtime module they want, while avoid importing other dependencies. Because of the deleter and compact C ABI support, the runtime object system is also designed in a way such that a resource can be allocated by different part of the module, without having to force everything into a single handler(e.g. memory allocator). So the result is that other system/runtimes can still freely integrate without harmful additional side-effect.

In your particular example of memory manager interference, most of the TVM’s low-level runtime accepts DLTensor, which allows the users to use system-level(e.g. the framework) manager rather than its own. Of course, if the user choose to run a larger-piece of program using graph module in the graph runtime or relay vm, some of the the orchestration(e.g. memory allocation) could be handled by the high-level module themselves(F5). But these additional orchestration and managing features are bought in a modular way, so that can be customized at any timepoint for better interaction with a larger system piece.

In terms of the practical values, having such minimum unified runtime allows us do something that we were not able to do before:

  • Compile a minimum standalone ML module that uses heterogenous devices(CPU, GPU, DSP) and export it as a single dynamic library.
  • Containerize ML applications in minimum sized containers.
  • Bring more dynamic workloads to embedded devices.
  • Easily move models between devices of interest, as long as they are supported by TVM.
  • Gain immediate infrastructure, such as free front-end APIs(python/go/rust/javascript), RPC support for ML data collection.

Finally, because we still need to handle packaging of a host program(CPU) that interacts with a GPU module and other components(e.g. NPUs) (F3). These requirements will eventually leads us to the same unified runtime interface design as the best design choice.

In summary, it is important to have a minimum unified runtime, and propose a design that works for both cases of standalone minimum packaging and system-level integration.