Bring Your Own Codegen to TVM

tqchen · October 27, 2019, 4:36pm

Given that @jonso originally proposed the idea, would be great if you can also work together:

To get everyone’s input and summarize your key choices
“Challenge” the RFC by asking hard questions that might affect the design
- How would we do X

jonso · October 28, 2019, 8:24pm

@zhiics @comaniac I’m happy to work with you on the design. A couple of thoughts I had:

I think that op-level annotation and registration through a decorator is a clean way to specify ops which are compiled and run by an external module
Should we split up compilation / runtime? For example, each runtime should have its own runtime::Module that can do serialization independently. It will probably be best to put compilation in a different place.
For graph partitioning and compiling in relay.build, I can think of a couple of solutions:
- Add a new field to the target string. For example llvm -mcpu=core-avx2 -extern=mkldnn. In relay.build, we can extract the extern target, partition the graph, compile, and generate the runtime.
- Add a parameter to the build_config as you suggested. We can then extract that value and partition, compile, and generate the runtime.
Personally, I feel that the target string option is the cleanest. What do you think?
I don’t think that naming functions in a module by their op name is sufficient. For example, say I am plugging in the TensorFlow runtime so I can run unsupported ops. Before partitioning, I can have an IR pass to group nodes that are entirely enclosed in TensorFlow ops into a single node. Running this whole subgraph in the TensorFlow runtime will be more performant than running each node individually. Maybe the naming convention can be [external runtime name]_op1_op2_op3...

Let me know if you want to schedule a call to discuss the details or if I can help in implementation in any way this will be a really useful feature, and I’m glad to see that other people think so too.

comaniac · October 28, 2019, 8:59pm

@jonso thanks for the comments.

We put both compilation and runtime to runtime::Module for two reasons: First, it is consistency with the current third-party supports such as CBLAS and CUBLAS. Second, this is simpler for contributors since they only need to maintain one place.
I’m fine with both for relay.build.
We do not name functions by their op name but an internal assigned ID for serving the purpose you mentioned. @zhiics is working on the tutorial for this RFC as Tianqi suggested, and I believe it would be clearer by going through it, but let me provide brief use cases here to give some flavors:

Our proposal includes two ways of graph partitioning: op-level annotation and graph-level annotation. For op-level annotation, developers only need to specify which ops are supported by the backend, and our partitioning pass will generate a function call for each supported op. Currently every single supported op will be a function, but we plan to develop an algorithm to group supported ops to one function to reduce the overheads like you mentioned. On the other hand, while the benefits of grouping ops is obvious in graph runtime, it is moderated in interpreter and VM. That’s also why we put the algorithm development to the second phase.

For graph-level annotation, we allow developers to write a Relay pass to insert subgraph_begin and subgraph_end annotations to the graph. As you can imagine, this provides freedoms for developers to implement any partitioning algorithm they designed for their backend. This can also be a workaround solution for the first phase of this RFC that we don’t have a well-developed partitioning algorithm yet.

For the TVM unsupported ops, my suggestion is that since our POC branch has a mechanism to check if an op is supported by a specific backend or not, we can treat all TVM unsupported ops as customized ops when converting the model from TF, and let this mechanism offload unrecognized ops back to TF. As a result, your purpose falls into this RFC and you can directly work on our POC with all our features reused.

zhiics · October 28, 2019, 9:17pm

@jonso Thanks for making these points and I am very glad to work together. Most of questions are answered by @comaniac. One thing is that putting extern in the target string might not be sufficient because 1) we need to change the way how target is parsed now, 2) what if there are multiple targets invoked? This may not quite possible for now. Directly adding an extern_target option in build of build_config might be simpler.

This is the tutorial we have for now:

github.com

zhiics/tvm/blob/partitioning/tutorials/dev/custom_relay_backend.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""

.. _tutorial-custom-relay-backend

This file has been truncated. show original

I plan to have an iteration on it to clean the branch and integrate the points we agreed here recently once the memory PR is merged.

jonso · October 28, 2019, 10:24pm

This is a really well-written tutorial very easy to understand.

I personally feel that the functions at python/relay/backend/op/contrib/gcc/extern_op.py are slightly overcomplicated the registration logic. It requires us to explicitly define all possible ops in python/tvm/relay/op/contrib/extern_op.py as well as their backend support in the subfolders. This can get a little confusing.

It seems like it would be easier for python/relay/backend/op/contrib/gcc/extern_op.py to register the operators directly. They can use decorator like @reg.register_extern_op("multiply", "my_external_compiler_name")
I also totally understand the point of having codegen be side-by-side with runtime module for developer simplicity, but it feels weird to have external runtimes be under the relay subfolder.

zhiics · October 28, 2019, 10:31pm

@jonso We actually have it

github.com

zhiics/tvm/blob/partitioning/python/tvm/relay/op/contrib/extern_op.py#L61


        extern_op = getattr(compilers[compiler], 'extern_op')
        if hasattr(extern_op, op_name):
            return getattr(extern_op, op_name)


logger.warning("%s in %s is not registered. Fallback to CPU", op_name,
               compiler)
return lambda x, y: False




@reg.register_extern_op("nn.conv2d")
def external_conv2d(attrs, args, compiler):
"""Check if the external compiler should be used.
"""
return get_extern_op(compiler, 'conv2d')(attrs, args)




@reg.register_extern_op("nn.dense")
def external_dense(attrs, args, compiler):
"""Check if the external compiler should be used.
"""
return get_extern_op(compiler, 'dense')(attrs, args)

I am also aware of your second point. The reason why I put it there is because I think putting it under tvm runtime is not good as it needs to take Relay expr/module as input. We can discuss where is a better place to put it.

tqchen · October 28, 2019, 10:35pm

Good discussions. Some high level thoughts.

I like the idea of separating runtime and compiler as it forces us to think more carefully about the runtime requirement(eg cannot touch compiler data structures)

We should avoid complicating build config and think about multiple target cases. One way is to make use of special function attributes that specifically tags the desired target of each function in the module

jonso · October 28, 2019, 11:35pm

@zhiics I mean enforce putting this registration inside of the subfolder (contrib/gcc/extern_op.py). In this way, we don’t have to worry about defining all possible external operators in the outer folder, instead letting individual libraries handle it themselves. I think that will save us some headache in the long-term.

@tqchen can you provide a little more detail on the multiple target case? If the overall goal is to automatically call PartitionGraph() based on external targets a user specifies, where would be the best place for the user to specify it? I suppose we can pass a list of targets to relay.build. These targets can be in order of preference.

Also, by module are you talking about the runtime module? If so, I definitely agree that we will need to add extra attributes so the runtime knows which module to call into.

comaniac · October 29, 2019, 12:29am

Our concerns to put the supported operators in each codegen are readability and maintenance. If we skip the central extern op module definition, we will need every extern op function to use decorator, similar to the current TOPI implementation:

@reg.register_extern_op("nn.conv2d", "gcc")

Considering most external codegen writers are not familiar with TVM codebase, this could be confusion and hard to be maintained by others. That’s why we choose the most straightforward approach. In fact, the current TOPI implementation is suffering from this problem as well, and we do have to plan to somehow get rid of it. In addition, even the writer wants to support an op that has never been supported and she forgets to put it to the outer module, it is easy to find out in at the early stage. More importantly, it provides a clear list to us about what kinds of ops can be supported externally.

jonso · October 29, 2019, 3:57pm

Got it, I didn’t realize that there was already some discussion on getting rid of this pattern in TOPI. In that case, the solution is fine with me

tqchen · October 29, 2019, 4:01pm

re multiple targets.

The main benefit of embedding these information into Module itself is to make the IR module self-contained. If we serialize and then load back a module with some functions already pre-specified to be built with certain backends(e.g. NPU), we don’t have to specify these additional context information in the build. It also prevents the options of relay.build from growing as we add more features.

By attributes I also meant attributes on the functions of IR Module.

zhiics · October 29, 2019, 4:24pm

@tqchen Just to double check, for multiple targets, you meant we still need to pass extern_target=xx to the build interface so that we can later attach to the subgraphs/functions, right? Or we add an attr field to Module and let users to annotate by themselves?

jonso · October 30, 2019, 2:10am

I’m also a little unclear here - the user shouldn’t set the external target of each op themselves, we should handle it automatically if the op supports it.

After solving this, maybe we can send out an initial PR for more specific code comments?

tqchen · October 30, 2019, 3:50am

The interface could still handle things automatically. But we could divide the handling into two steps, the partition function that partitions the functions and set the attributes, and the compilation function that compiles the result module

zhiics · October 30, 2019, 6:48pm

I think we can just do the following:

mod = relay.build_extern(mod, target='xx')
mod = relay.build(mod, target='llvm')

build_extern splits the graph into subgraphs and annotates them with external targets. The second one is the normal one for compilation. It can take whatever partitioned and annotated graph to produce the compiled artifacts.

broune · November 4, 2019, 7:25pm

Below are some areas that it wasn’t entirely clear to me how they’d be handled. This relates more-so to parallelism, but seems like this particular approach to bringing your own codegen would also impact how parallelism is represented, so I’ll ask it here, but let me know if I should be looking at some different proposal for that.

Some single ops will need to be executed concurrently and cooperatively across multiple devices, how do we represent that? This is typical for sea-of-nodes hardware and in general for model parallelism.
Just because an op can run on a device doesn’t mean it should. For an extreme case, consider a TF graph that got split into several pieces and one of the pieces is just an addition by itself, or just a relu by itself. It doesn’t make sense to transfer the inputs to the device and retrieve the output back just to do an addition. Also some ops might make sense to transfer back to the CPU because the CPU can do them faster and then back to the device, e.g. sparse lookups on non-sparsity-appropriate hardware - I think this proposal would require cutting the graph into pieces in this case, which has the usual problems that cutting of the graph entails. Automatically or manually determining which ops should run where to optimize performance and minimize memory usage (not just to do something that works) is going to be a big thing over time and specifying such a thing should be a good fit.
Two devices may need to communicate with each other and we do not want to force them to go through the host for a large transfer. This is again typical in sea-of-nodes hardware and comes up in other contexts like just plain model parallelism. How do we represent sending data between devices? Can two functions refer to each other’s nodes?
(follows on from last point) How do we represent overlapping transfers of data with compute in a fine-grained way? This is an important optimization.

comaniac · November 4, 2019, 9:53pm

Thanks for the comments and they are valuable in my opinion. Here are my thoughts mainly about the scope of this RFC. We could refine it based on the discussion.

Could you make some examples so that we can see directly how it should work? It seems to me that this is an open question and we should narrow down to practical scenarios.
Correct. This proposal doesn’t cover a mechanism to decide an op should go to which device. We simply do offloading based on the subgraph attribute. As the first step of supporting graph partitioning, we aim to make the offloading mechanism working so that we can follw-up with those advance issues easily.
We make the runtime mechanism straightforward by simply tansferring data at the subgraph boundaries. As a result, it is true that uncecessary data transfer happens with two consecutive subgraphs. We plan to address this issue in the subgraph merging problem, which we will file another RFC for discussion. In the subgraph merging problem, our goal is to minimize the number of subgraphs while preserving the correctness.
Similar as 3, we aim to merge offloadable ops and minimize the subgraph numbers in the follow-up RFC.

broune · November 5, 2019, 12:01am

If you have a giant tensor coming into an op, you might want that op to be parallelized across multiple devices. This would apply to sea-of-nodes inference hardware and also comes up in training. TF has some documentation on it here:

github.com

tensorflow/estimator/blob/master/tensorflow_estimator/python/estimator/tpu/spatial_partitioning_api.md

# Spatial partitioning

Spatial partitioning allows us to run models with larger input images. Typically
these models will be too large to fit on a single TPU core.

Spatial partitioning uses multiple cores to process different parts of the input
tensor. Each core communicates with the other cores when necessary to merge
overlapping parts of the computation. All the complicated merging logic is
implemented in the XLA compiler, therefore you only need to configure how the
inputs to your model are partitioned.

Note: Spatial partitioning only distributes activations across multiple cores.
Each core still maintains its own copy of the model weights. For most image
model, activations use more memory than the model weights.

## Enabling Spatial Partitioning with TPUEstimator

Spatial partitioning doesn't require any code change in your model. You only 
need to specify the spatial partition parameters in your TPUConfig.

This file has been truncated. show original

I think minimizing subgraphs by merging them isn’t going to be a full solution here. I think you want to be able to represent transfers between devices (including the host) in a way beyond just function calls and return values. Essentially you want some way of representing an edge between ops in different functions that are placed on different devices (counting the host as a device here) that can transfer asynchronously. The question is how to actually represent that, where thorny issues include things like safe memory allocation and dead locks (e.g. ensuring that both sides of the transfer arrive at the transfer at the right time).

A related question is how to pass in a buffer that is already on the device, e.g. passing in weights on the device that shouldn’t be transferred for each inference. I think in this approach that would not be possible.

When tagging a function to run on a specific target, it seems not enough to say what the device type/backend is, it’s also necessary to say which of the devices of that type to run the op on, since a host can contain multiple devices.

I wonder if this proposal isn’t mixing up two things that could be separate: A) the ability to compile and run ops with a custom codegen/runtime and mixing this within a graph, and then B) how to represent parallelism and cross-device communication (counting the host as a device), as in tying this to function calls. The solution to A has implied a lot of things about B, like how graph partition occurs and tying data transfer to function calls. I was responding just to the B part. I wonder if A and B could or should be separate proposals? (not sure - maybe they are just too intertwined to split apart)

tqchen · November 5, 2019, 12:22am

In the interest to keep things concise, I think runtime parallelization is something that worths its own RFC.

If the decision of separation can be made at compile time(or JIT), Likely the same abstraction will stand for that purpose. e.g. havea DyanmicDispatcher module that schedules things into functions provided by each of the runtime module, while we can have runtime module that is not aware of parallelism.

zhiics · November 5, 2019, 6:31pm

I think we don’t really bake the custom/codegen information into the subgraph/sub-function. We need some sort of this information at the high-level to help us partition the graph. Once the graph is partitioned, the new graph will contain super-nodes that will be dispatched to the target backends. The runtime of each subgraph is independent which is coordinated by the TVM runtime/executors.

We can extend this to support multiple backends by specifying multiple targets. But we need to have some mechanism to decide where an operator should be offloaded to when it could be supported by various devices. This is currently not supported. The short-term goal is to prepare the basic infra first, and more sophisticated coloring/annotation will be considered later.

I agree that the current design doesn’t consider about the runtime parallelism (both op and model level), but as Tianqi said this is worthy a separate discussion.

I think we’ve solved most of the problem in the early discussion. I will prepare a WIP PR and see if we missed any important thing.