[BYOC][runtime] JSON runtime for BYOC

We have currently built the infra for Bring-Your-Own-Codegen. For demonstration purpose, a simple CSourceModule style codegen and runtime is used for ccompiler and dnnl (now called oneDNN). CSourceModule runtime works reasonably well on small examples and it is easy to understand. However, it also poses quite a few challenges on development and deployment of relatively large models or models with relatively large inputs.

  • The serialization is quite cumbersome as it normally works on per operator and emits a wrapper to invoke the library.
  • Handling last constants is difficult. We currently either have to introduce countless assignments or allocate a large chunk of memory on the static segment. These approaches may significantly increase the compilation time.
  • For certain backends, like TRT and dnnl, CSourceModule complicates the use of or even makes it impossible to use their execution engine.

This RFC proposes a JSON runtime associated with a JSON serializer for BYOC which effectively solves the above problems. In addition, this type of runtime is more familiar to the community as the graph runtime is more or less in this style and we have already implemented a minimal example runtime. This RFC extends the minimal example and makes it more general to all backends with execution engine.

  • JSON nodes and code generator/serializer

    • Data structures to represent the nodes and entries in a json runtime. The serializer converts a Relay program into JSON format.
    class JSONGraphNodeEntry {};
    class JSONGraphNode {};
      SOE // Serialize a Relay program into JSON frormat, graph and params
    // should be saved in the same artifact
    class JSONSerializer : public ExprVisitor {};
    
  • JSONRuntimeDriver

    • Deserialize the artifact and manage the initialization and invocation of the runtime.
    • Cache the engine when loading the library
    JSONRuntimeDriver : public ModuleNode {
    void Deserialize(); // Deserialize the artifact and engines
    PackedFunc GetFunctioin(); // Invoke a subgraph using symbol
    static Module LoadFromBinary(); // Load the JSON binary
    void SaveToBinary(); // Save the module
    
  • JSONRuntimeBase

    • The base for handling a graph. It will be extended by the concrete backends, like TRT, dnnl, and other accelerators.
    class JSONRuntimeBase : public ModuleNode {
      virtual void Run() = 0; // Invoke an engine
      virtual void Init() = 0; // Build an engine
      // Utilities to save and load a json graph.
    };
    
  • Open questions

    • Symbolic representation of op attribute, i.e. Expr start and Expr end in the arange op. Normally, we should not offload this type of nodes to accelerators, but how can we serialize them if we want to support as some of them may not be data-dependent?
    • It’s intuitive for BYOC to be used along with uTVM. How this JSON runtime will be connected with other runtimes like utvm?

@tqchen @thierry @mbaret @masahi @comaniac @manupa-arm @jonso @ramana-arm

3 Likes

This would be a welcome addition to the BYOC infrastructure, particularly in reducing the fragmentation between approaches for different backends. I think it’s also important we have a robust alternative to the CSourceModule approach as it’s becoming clear that’s not yet suitable for full scale integrations.

@lhutton1 has been working on JSON serialisation for Arm Compute Library, so I’d be interested to see his thoughts on whether we can align on a common format.

Thanks, I think this will be very useful. I think the benefit of this approach is that it allows the run-time to be customized much more easily. I like the idea of being able to cache an engine (in my case this will be a series of ACL functions) - this opens up opportunity for optimization of libraries that are stateful and therefore reduce overhead on subsequent runs of the same module. It looks like this will generalize much of what I have in my current ACLRuntimeModule implementation.

Just some questions I have, mostly relating to problems I ran into with my current implementation… I may be pushing the idea behind this RFC slightly:

  • Is there any plan to include a generic JSON graph representation which can be traversed in the runtime or will this be upto the library integrator? I can see many libraries implementing similar representations.
  • How do you plan on serializing params i.e. base64 serialized ndarray? Or will this again be left to the integrator to decide?

I think these are fair problems, and json is an OK solution for some particular backends. However, I think it is in particular important for us to think about the infrastructure implication in the long run. I think we want to discuss the solution in a case by case manner.

The JSON runtime is essentially another layer(abstraction) of graph. The codepath becomes

  • IRModule -> compile-> runtime::Module(json-style) -> interpret -> external API.

As usual, introducing additional layer of abstraction always solves our problem, but we want to ask is that really the approach we want to take. Right now there are two types of external APIs.

External API Types

  • E0: Library functions(ArmCompute, DNNL, cudnn) that have routines into the libraries but not necessarily a serialization format for weight and function.
  • E1: Graph runtime that constructs the graph on the fly with a series of APIs, and run function.
  • E2: Graph runtime-style(TF) that have a serialization format(e.g. protobuf) for both weight and function.

Problems to Solve

  • P0: How to serialize the constants(weights)
  • P1: How to serialize the computation(code)

Discussion

Our overall principles are:

  • P0: Minimize “external specific passes”: make sure that the compilation stays in IRModule as much as possible.
  • P1: Reduce layers of abstractions as much as possible.

For E1(e.g. TF): we should definitely avoid the additional layer of abstraction. Because we can simply go head and use the native serialization format. Both P0 and P1 can naturally be solved in this case.

For E0, the best approach is to lower the sequence of libraries calls into TIR calling sequences after the unified IR. The argument there is that since they are already API functions, direct compilation opens the path for future AOT. See also a discussion here [Guideline] Relay AOT

The case of E1 is certainly more complicated, it requires two functions:

  • Init that constructs the graph(Module) and only runs once
  • Run that executes the existing library.

They could still certainly be lowered first to TIR then to calls into the C API, at least the code(P1) part.

The main challenge for us is how to come up with a solution for P0: weights or large constants, and I do think that this part deserves more careful thoughts, so I will discuss it in the next section.

Weight(Constant) Serialization

My first reaction to the weight serialization is that they should be lifted outside tof the external codegen when possible. This way we could reuse tvm runtime’s native mechanism to store these NDArrays and as the serialization mechanism improves to more variaties(static code section and binary) for different cases, we will be able to take benefit of all of them. We won’t have problem for most APIs in E0.

Of course the main problem of such an approach is for cases in E1. As many of these types of APIs needs to “pre-compute” some intermediate representations of weights from existing ones. There are two two possible mechanisms:

  • M0: Couple the code serialization and meta-data constant serialization into a single format in a single runtime::ModuleNode.
  • M1: De-couple the code serialization and meta data serialization

All of our current runtime module design are based on M0. Let me give an example of M1

A Layered Approach

Here is an example of the layered approach,

// Using DSO library as an example for code serialization

static Engine cached_engine;

TVM_EXPORT_TYPED_PACKED_FUNC(__InitModule, [](Array<NDArray> metadata) {
   engine = InitEngine(&cached_engine, metadata);
});

// Alternatively, the destroy can be a C API that the DSOModule recognizes 
TVM_EXPORT_TYPED_PACKED_FUNC(__DestroyModule, []() {
   engine = DestroyEngine(&cached_engine);
});
class ModuleMetaDataWrapper : public runtime::ModuleNode {
  public:
    InitSubModule() {
       // get function from its imported modules
       PackedFunc init = 
            this->imported_modules[0]->GetFunction("__InitModule");
       // can also pass meta data in via positional sequence
       //  before runtime::Array lands
       init(metadata);
    }  
    ~ModuleInitWrapper() {
       PackedFunc destroy = 
             this->imported_modules[0]->GetFunction("__DestroyModule");
       destroy();
    }
 
    GetFunction(name) { 
       if (!initialzied_) this->InitSubModule();
       if (name != "__InitModule" && name != "__DestroyModule") {
          return this->imported_modules[0]->GetFunction(name);
       }
    }

  private:
   bool initialized_{false};
   // meta data serialized in along ModuleInitWrapper
   // can support other meta data
   Array<NDArray> metadata;  
};

When generating code, we can generate a ModuleMetaDataWrapper{imports={DSOModule(dnnl_code)}};

The main advantage of the layered approach is that we can de-couple the constant serialization from the code serialization itself. We can mix and match the serialization mechanisms, for example, build a common constant serialzation format, or reuse existing ones.

Another advantage is that it opens path for AOT to completely discard the interpreter if necessary. While still making other runtimes possible(e.g. the DSOModule can still be replaced by a json one).

Discussions

My take to the JSON runtime is that it is a short term solution to the problem that we want to solve. While it is fine for a particular runtime to adopt json as a serializaiton format, I don’t think it is a good idea to introduce another layer of common abstraction.So it may not be the long term solution that we are seeking for.

As always would be helpful to discuss the problems in a case by case manner. In particular, would love to get everyone’s view about the de-coupling and refine the ideas here. So that we can build a modular solution that works for all runtime cases (AOT, VM, graph) in a single API.

2 Likes

also cc @FrozenGene @junrushao

1 Like

@tqchen Thanks for the comment and sharing of thoughts. Yes, the fundamental problem here is the serialization of code and weights. Code is relatively easy to handle and weights are the real problem. I agree that a json runtime introduces another layer of abstraction for graph which the current CSourceMdoule way doesn’t. I think I don’t fully understand the layered approach you proposed here.

Could you please elaborate a bit more about the execution flow after introducing it? and also when should we build and cache the engine, i.e. what’s the input for the process to build the engine?

Thanks.

Here is an example(I also updated my code above according as there is a minor problem), to construct the code manually

mod = ModuleMetaDataWrapper(metadata)
mod.import_module(CSourceModule(dnnl_code);

mod.export_library("xyz.so")

loaded = tvm.runtime.load_module("xyz.so");

After we load it backin, it becomes

loaded = ModuleInitWrapper()
loaded.import_modules = [DSOModule()]

Now when we call f = loaded.get_function(name), the call steps are as follows:

  • Calls into ModuleInitWrapper.GetFunction, which checks if the imported module is initialized
  • If the imported dso module is not intialized, ModuleInitWrapper.GetFunction will call __InitModule in the DSO module by passing the meta data to it
  • All the get function calls redirects into the DSOModule then it can return the function as normal
  • When ModuleMetaDataWrapper destructs, it calls into the destroy function.
// implementation of ModuleMetaDataWrapper.GetFunction
class ModuleMetaDataWrapper : public runtime::ModuleNode {
  public:
    GetFunction(name) { 
       if (!initialzied_) this->InitSubModule();
       if (name != "__InitModule" && name != "__DestroyModule") {
          return this->imported_modules[0]->GetFunction(name);
       }
    }

  private:
   bool initialized_{false};
   Array<NDArray> metadata_;  
};

Thanks for the explanation. I have a further question based on your example.

If I understand correctly, this example works for a scenario that a customized codegen will generate metadata and kernel code. The kernel code here may include external library APIs or graph execution engine that interprets a subgraph in any forms. When user calls export_library, we compile the kernel code (or the engine) to a CSourceModule binary and use it in runtime.

My question is that in this case we compile an engine every time when exporting a module. In contrast, the original objective of this RFC is to propose the following flow:

  • Customized codegen: Generate metadata and JSON for subgraphs. Since they are all data, we do not have to compile but just need to serialize them when exporting the module. Since data serialization should be general, we may even implement this as JSONCodegen so developers do not have to worry about this module at all.
  • Customized runtime: A standalone runtime engine based on ModuleNode that invokes an engine to execute a subgraph in JSON format. The runtime will be compiled only once when building the TVM runtime. For example, a user may run make runtime on an edge device to compile this runtime engine, and feed it with the metadata and JSON generated by the customized codegen (or JSONCodegen).

While I understand that the ModuleMetaDataWrapper you proposed could be used in the customized runtime, I’m not sure if we have to import this runtime module to the model specific module library.

Thanks for the quesitions.

The JSON proposal is another layer of abstraction that serves as a interpreter for general workloads. As it defers the running of the library code by interpreting the “bytecode” in this case defined by a json format.

I understand the objective this RFC proposes and all the new features it enables(e.g. deferred compilation to the target), as new layer of abstraction certainly brings flexibility and new features. The main challenge though is how can we maintain a minimum layers of abstraction – which is important for the infrastructure overall.

While the DSO engine compilation based approach is more like an AOT, that compiles the model into a sequence of library calls.

The ModuleMetaDataWrapper is a way to de-couple the code serialization from the meta-data serialization. It does not, however, force the code exportation to be in the dso format. DSO is an example since it is the most restricted one. If the implementation would like a json or other bytecode based interpretation, they can certainly choose to do so.

One observation that meta-data serialization is more common. I do not quite get the model specific module library. But just like DSOModules, it would be nice if we can make ModuleMetaDataWrapper a generic module that can be composed on top.

I am not trying to push ModuleMetaDataWrapper as the final solution. But I do think we should refine along the direction of “decoupling code from data” so that the same mechanism works for interpretation and AOT in general.

@tqchen Thanks for the detained explanation. The AOT and interpreter example is quite accurate. They trade-off the efforts from codegen and runtime sides. I think I pretty much understand what you propose now. It is more consistent to what we currently do to handle DNNL in terms of emitting C++ code wrappers for library calls, but it extends the support of metadata and more pluggable.

The proposed I had was to ease the work from the codegen side (likely they don’t really do anything). This is because we are seeing many people have difficulties in generating the CSourceModule. It indeed requires users to have more work on the runtime side (i.e. they need to parse the json and build an engine). Their engine then can interpret the subgraph for execution.

I want to clarify when we want to create the engine in your propose. My understanding is that we do this at codegen stage (i.e. in the AOT manner by emitting the code for setting up the engine and cache in the wrapper module), right?

yes, in the case of DSO module the engine creation is a function emitted by the codegen.

Note that my main point is about de-coupling(the meta-data(weights) from weight) and it would be good to discuss further what the class should look like. In terms of the code part, we could certainly allow user to have a json interpreter, or an AOT version.

Per offline discussions, here is a summary of the updated proposal:

  • The original proposal uses a runtime module to maintain both json and metadata (e.g., constant weights) together. As @tqchen pointed out, although this is simple to be implemented, it is hard to debug and cannot be shared over other runtimes such as VM and AOT.

  • Accordingly, we now aim to separate code and metadata. In other words, while the code can be in any formats (e.g., C++ for CSourceModule; JSON or other forms for customized runtime), we will use the same mechanism to deal with the metadata over all kinds of runtime modules.

In terms of the implementation and user inreface, the current API is:

json, lib, params = relay.build(mod, target='llvm', params=params)

One question to solve with the current API is that we only have one fixed output from relay.build for the built runtime module, lib, so it is not straightforward to get both code and metadata via this API. As a result, instead of returning a runtime module, we are thinking to return a PackingModule for the Relay module with external functions, and use modular approach to build pipeline on the Python side.

PackingModule is composed of a mapping from symbol to (implement, {var_name: NDArray}). The implement would be the C++ code for CSourceModule, or JSON string for JSONRuntimeModule. In this way, we could have a unified inferface for both cases.

@tqchen do you think it’s ok to use the PackingModule, or you have a better idea?

I like the modularized setup that decouples the meta-data from the code. It would be great to have a brainstorm and discussion about the naming candidates for the PackingModule.

Also cc @junrushao @FrozenGene

1 Like

I am not sure if the clarification of packaging part is clear enough, but there is actually a potential problem. The goal is to be able to conveniently assemble code and metadata separately from the frontend in a modular way. The generated artifact is intended to be usable by AOT, graph runtime, and VM for both CSourceModule and JSON style runtime.

Here we need to pass back code and weight to the Python side. They may have to be part of the lib because we only return bytecode/(graph, params) and lib for VM and graph runtime, respectively. We are trying to use the so-called PackagingModule (TBD) to do this. They will be imported to the lib after compilation. Therefore, the module with external library code would look like the following for CSourceModule after compilation for graph runtime and VM:

DSO (A)
    |---PackageModule(code, {var_name, metadata}) (B)

From the Python side, we can assemble it by extracting code and metadata from the imported PackageModule B (i.e. code = code of A.imported_modules[0], metadata = metadata of A.imported_modules[0]).

Then, we can assembly the modules and compile/interpret them. But I do have one question when we are assembling them. The DSO module (A) contains the other part of the graph that should be handled by TVM. We now actually want to replace (B) with the newly created module (i.e. the ModuleMetaDataWrapper) and then do export_library and load them back for execution. It seems we are not really able to remove/replace it. One possible way I can think of is that we can add a ClearImports method to Module to clear the imports for the DSO and then we can package the new modules. @tqchen Does this sound good? Or do you have any comments/suggestions?

I don’t think that would become a problem, under the new module serialization https://tvm.apache.org/docs/dev/introduction_to_module_serialization.html

We will simply recover several DSOModules, all of them share the same library

1 Like

Here is the draft PR: https://github.com/apache/incubator-tvm/pull/5770

We may need to use Map to save the variable to constant/NDArray mapping. Should we move the ModuleInitWrapper out of runtime because it otherwise needs to have map in the runtime namespace?

I used SourceMetadataModule as the name for the the packing module. Do we have any other better names?

@tqchen

cc @junrushao as well

1 Like

We want to think about alternative ways to pass in the meta data, for example we could call initialize using array instead of Map. While it is OK to use Map in runtime, we will face a similar issue in ucontrollers where it is harder to pass in a Map structure

I thought about array as well. Passing array to initialize is relatively simple. The more tricky part is packing the data and passing them around using packedfunc.

An alternative approach might be just using an Array to store key-value pairs, but the problem around is that we need extra effort serializing/deserializing Map from/to Arrays