[DISCUSS] Module based Model Runtime Interface

The module interface works great for deploying the generated operator libraries in TVM. However, we are still facing a challenge on how to have a common deployment interface for machine learning models.

Our most commonly used interface is the graph runtime, whose mechanism is a bit different from low-level code loading (not as simple as module.load). In the past RFCs we already tried to argue that we should use the same module mechanism to deploy ML models. However, there are still quite a few challenges that need to be addressed:

  • R1: The creation of ML model can be context dependent. For example, the user needs to be able to specify which GPU to run the graph runtime on.
  • R2: We start to have multiple variations of model runtime, such as RelayVM. While it does not makes sense to force all model runtime to have the same set of APIs, it would be helpful to have a same mechanism for packaging and loading.
  • R3: In advanced usecases, we want to be able to bundle multiple models into a single shared library.

It is not hard to propose a few interfaces to “solve” the above challenges. However, it is hard to agree on “the interface convention” for TVM’s ML model packaging.

As a basic principle, we directly make use of the current Module system to build a convention on top. An important thing to keep in mind is that we hope to let the users learn as little as possible.

The Raw Interface Strawman

Given that all the additional wrapping boils down to the raw module interface. We start the discussion by a strawman’s proposal using the raw module interface.


# lib is a GraphRuntimeFactoryModule
# that contains json and parameters
lib = tvm.module.load("resnet18.so")

# Call into the factory module to create a graph runtime
# Having this additional factory create step solves R1
# Note that parameters are already set
#
# The first argument is a key that helps to solve R3
#

# Alternative API 0: take model as key.
gmod = lib["runtime_create"]("resnet18", tvm.cpu(0))
# Alternative API 1: select the model then construct
gmod = lib["runtime_select"]("resnet18")(tvm.cpu(0))
# Alternative API 2: directly key constructor by model name.
gmod = lib["resnet18"](tvm.cpu(0))

set_input = gmod["set_input"]
run = gmod["run"]
get_output = gmod["get_output"]

# We do not need to set the parameters here
# as the models
set_input(data=my_data)
run()
get_output()

# Alternative: sklearn style predict function
# that might works better for VM, might help solve R2
predict = gmod["predict"]
output = predict(data=my_data, out=out_data)

A few highlights:

  • H1: Instead of directly returning a GraphRuntime module when we load, we only load a factory module that contains the necessary meta-data. Then another call of the create function will create the actual graph runtime module.
  • H2: The create function takes in a model name as a key, which potentially allows bundling multiple model/modules into the same shared library.
  • H3: The json and parameters can be bundled into the factory module. That means the create function only have to take a context parameter. This interface brings some future benefits – e.g. we can use AOT to generate a logic that runs the graph runtime, but use the same interface.
  • H4: Depending on what interface we encourage (set/run/get) vs predict. We can have a different levels of interface sharing between VM and graph runtime.

Discussions

Here are some points for discussion

  • D1: do you like the factory pattern, shall we always require a model name field (and allow “default”), or shall we take the alternative API specialization approach.
  • D2: set/run/get interface and predict
    • set interface is useful to allow users to set parameters during runtime.
    • run is useful to do fine grained benchmarking
    • predict is a more high level user friendly API, note that we still want to allow destination passing style(pass out) to allow more flexibility.
    • predict forces us to enable runtime tuple support in the case of multiple output, while get_output keep things simple and minimum.
  • D3: runtime argument specification convention for multiple contexts in the hetrogenous env. Under the restriction that PackedFunc only takes positional argument.
  • D4: Do you like the new way of packaging, or is it fine to continue use the old graph rt API.

API Wrapping

Most of the above discussions are for the raw APIs. To make life easier for our users, we can still do a minimum wrappings around the raw API.

Ask user to specify the wrapper type

The first way is to ask the users to directly construct the wrapper API using a constructor/create function.

# that contains json and parameters
lib = tvm.module.load("resnet18.so")

gmod = graph_runtime.create(lib["resnet18"], ctx=tvm.cpu(0))
gmod.set_input(data=my_data)

# sklearn style predict API
out = gmod.predict(data=my_data)

Note that the main purpose of the wrapper API is to provide clear documentation for the most common usecases. The full power of the module is always available as the raw API.

Automatically create wrapper type via type_key

Suggested by @FrozenGene . Alternatively, we can automatically create a wrapped module class using the type key. This requires us to handle rpc in a clear way(instead of using RPCModule as key, we need to get the key in the remote).

Comparing this approach with the approach above. The user does not need to specify the module wrapper and the module wrapper class is directly created during load.

It does complicate the module loading and return logic a bit (e.g. do we need to also do it for all of our modules? just like the different node variations). The inconsistency of the wrapper class and the type(e.g. an RPCModule can be wrapped as GraphRuntime) does confuses this API a bit. While in the case of node and object system, the type of the sub-class directly corresponds to the type of the object.

Would love to hear everyone’s thoughts about these two kinds of APIs.

tvm.register_module_wrapper("GraphRuntimeFactory", GraphRuntimeFactory)
tvm.register_module_wrapper("GraphRuntime", GraphRuntime)

# gmod factory is an wrapper automatically created with type key
gmodfactory = tvm.module.load("resnet18.so")
# name up to disicussion
assert isinstance(gmodfactory, GraphRuntimeFactory)

# automatically return the corresponding module wrapper by type key.
gmod = gmodfactory["resnet18"](ctx=tvm.cpu(0))
assert isinstance(gmodfactory, GraphRuntime)
4 Likes

related Standardize GraphRuntime Exports into a Single DLL

Would love to see everyone’s opinion about the proposal

When we pack the params into the shared library, we will meet one compilation time problem. After serializing params, the dev.cc will becomes much larger than before. When to compile one resnet18 workload in the ctx of gpu., dev.cc file size is 2.6M(before) v.s. 380M(after). The compilation time will become from 24.97s to 53.64s on my machine(Intel i7-7700 CPU @ 3.60GHz). This is worse when we compile it into cpu ctx (18.91s v.s. 86.65s)

@tqchen

I use -ftime-report option to do investigate this issue. I find that the bottleneck time is parsing part, not optimize part. So if we could write the tvm_dev_mblob array into object file directly, we should resolve it.

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1408 kB ( 0%) ggc
 phase parsing           :  78.28 (89%) usr  47.62 (99%) sys 125.91 (91%) wall14727874 kB (100%) ggc
 phase lang. deferred    :   3.02 ( 3%) usr   0.00 ( 0%) sys   3.01 ( 2%) wall       0 kB ( 0%) ggc
 phase opt and generate  :   6.81 ( 8%) usr   0.24 ( 1%) sys   7.06 ( 5%) wall       5 kB ( 0%) ggc
 phase finalize          :   0.00 ( 0%) usr   0.04 ( 0%) sys   1.80 ( 1%) wall       0 kB ( 0%) ggc
 garbage collection      :   3.02 ( 3%) usr   0.00 ( 0%) sys   3.01 ( 2%) wall       0 kB ( 0%) ggc
 callgraph construction  :   6.81 ( 8%) usr   0.24 ( 1%) sys   7.06 ( 5%) wall       5 kB ( 0%) ggc
 preprocessing           :  14.37 (16%) usr  24.75 (52%) sys  39.85 (29%) wall      23 kB ( 0%) ggc
 parser (global)         :  63.91 (73%) usr  22.87 (48%) sys  86.06 (62%) wall14727850 kB (100%) ggc
 TOTAL                 :  88.11            47.98           137.87           14729306 kB

And I also test tcc compiler, whose parsing speed is very fast. And the compiling time only need 5.04s.

I think of one trick way, which should solve this problem.

The compilation speed is slow, because we have blob like this:

const unsigned char __tvm_dev_mblob[46788038] = {0x.., 0x...}

However, if we change it into this

const unsigned char __tvm_dev_mblob[46788038] = {"TVM_BLOB_SIG"};

I have tested it, compiler will compile it very fast, because compiler will parse a few character. Then we could get one dev.o (like this case, the size is 45M). At the sametime, we store our original binary data 0x...0x... to one string (could name blob) and expose it via one function to Python.

In our Python side, we will open the dev.o in binary format and try to find the TVM_BLOB_SIG start position. Then we will write the blob content to replace.

According to this way, we could avoid compiler’s parsing speed problem.

I think this should also boost our current GPU compile speed when we export to library, not only the params we discuss here.

Nice investigations. So the key question is to be able to quickly export an object file with the binary symbol we want.

What you mentioned(export then patch) is certainly one interesting way to do it, it might depend on the spec of the binary file though.

Another possible way is to use related utils, for example, we can create an LLVM module which contains the binary symbol and try to export that without any of the optimizations passes. Or we can even dig up if there is related utilities in llvm to build such object file.

Note: this is relatively independent from the problem of minimize device blob obj generation. Looking at the size of the parameters, two things comes to my mind:

  • First of all, if we are going to export the parameters with the DSO, whether or not should we offer the quick option to compress them in the binary(e.g. zlib the parameters before serialization, or at least offer it as a form of storage format.
  • We should have a simple way to allow user to specify not to include the parameters in the DSO, it would be interesting to ask what that API should look like

For example, here is one possibility. It might also worth ask everyone’s opinion about the API and thedefault option

mod.export_library("xx.so", package_params=False)

I think we should be better not append this parameter in the API of export_library. Because this is one base function for all modules, which is not only for graph runtime.

I prefer create one helper method in the

class GraphRuntimeFactoryModule(Module)

We could name it

def PackageParams(value -> boolean)

which will set the is_package_params_ in the C++ class GraphRuntimeFactory, when we do SaveToBinary, we could according to is_package_params_ to judge packing params or not.

What we do for exporting is the same mod.export_library("xx.so"), but if you want to package params or not, you should call PackageParams(True) or PackageParams(False).

Of course, we could discuss the default behavior of PackageParams.

sure this is also a viable option (by setting states of the modules) :slight_smile: After we have a few candidates, we can list these options( including the one in the export_library) and their pros and cons, and ask everyone’s thoughts.

Yes. Correctly. If we could beyond the parser stage, the time is not our bottleneck now. I have tested it too. The steps:

clang++ dev.cc -S -emit-llvm -o dev.ll
llc dev.ll -O0

The step 2 is very fast. So maybe we could create one .ll / LLVM module for it and emit object file, finally returns to Python side.

OK, can you explore that a bit? i imagine we can use something similar to codegenllvm(but a simpler version) to build a LLVM module, wrap it in LLVMModule(which exposes the save function already and return it)

Yes. I will do. I also have plan to separate this work alone with graph runtime :slightly_smiling_face:

About the spec of the binary file, after thinking of this second thought, I think my way I mentioned shouldn’t depend on the spec of binary file.

The way I mentioned we will read the object file in rb+ using Python’s open. Then we search the pattern in binary file. After we find it, we will replace it using the blob data (0x…0x…0x…), However, we will write it in the binary form (https://docs.python.org/3/library/binascii.html#binascii.unhexlify). We don’t touch anything of object file format.

If we use LLVM module, in fact, we will do the similar things of write the blob data (0x…0x…0x…), but we will construct the global variable @__tvm_dev_mblob char array. And a little more thing we need to do than the patch way is we should get the target we want to generate (arm cpu / x86 cpu or others) so that we could generate right object file.

Anyway, I will also explore the LLVM module ways and compare with the two ways next.

1 Like

The solution will depend on the fact that object file’s content is directly written as a binary section(without compression) This is indeed the case i guess for most obj files, but i am not sure if it is all the cases

1 Like

ok, i understand your concern now. However, I also don’t know is there some object file compressing the blob binary data, so I can not answer your question. In the technical view, I think object file design shouldn’t be, because when the loader load it back, it will increase the time of decompression in my opinion, worst case the decompression time is longer than parsing binary data time.

However, I also explore the LLVM module way as said before. But no doubt, this way will have more work to do than the patch way :slightly_smiling_face:

I have completed the way (patch and export), it could work fine on Linux / MacOS (I have no Windows machine so that I can not verify the Windows, but I use clang to generate Windows object file and could find the patten (TVM_BLOB_SIG) using Python’s binary open and read). The compiling time compared with previous GPU export library compiling way (export to C program and embed the blob data into char array directly), it has a little compiling time improvement when to build CUDA resnet18 on one server machine. 11.224s -> 10.748s (Run 5 times and get the average). However, the compilation time will increase to 128.46s when to package params into shared library if we use old way. But new implementation will still keep 13.28s. This is the result we expect.

Next I will investigate the LLVM module way.

I have completed the LLVM module method roughly. The effect is the same as my previous way. Both could achieve the same speed. :slightly_smiling_face:

I plan to PR this work firstly, because it is an isolated work with this RFC. However, there is some design to discuss.

Do you think it is worthy creating one isolated code generator like CodeGenLLVM?

My rough implementation currently is just one method existed inside llvm_module.cc and return LLVMModule back to Python.

Would love to listen to everyone’s thought. @tqchen @zhiics @yzhliu @wweic @junrushao

reusing llvm module sounds fine. The code path can still exist in llvm folder and expose a PackedFunc that takes in binaries and packs to an module, if llvm is not available, we can fall back to the source module

Thanks for the suggestion! So you agree we reuse llvm module and don’t create one isolate code generator too.

When you talk the code path can still exist llvm folder, do you mean we should create one new file (like codegen_blob.cc) and contain the PackedFunc? I like this way. +1 here!

Or do you think we could just create one new packed function in the llvm_module.cc so that we only contain file of codegen real hardware (like codegen_arm, codegen_cpu) in llvm folder.