How to effectively retarget/port TVM to new AI accelerators?

Great to hear that the offloading feature is WIP!

I was wondering how does the API/interface with the device/compiler looks like? Is a library provided by the vendor of the accelerator required or just a cross-compiler for the accelerator and some communication API is enough?.

Looking forward to the RFC and the tutorial!

The graph partitioner will take the entire Relay graph representing the network and create subgraphs from that which can be compiled via your external compiler. You then need to create a pass which consumes that Relay subgraph and translates it to something your external compiler can understand and then compile (you can write this directly in TVM). src/relay/backend/contrib/dnnl/codegen.cc shows an example of this (although it sounds like your case may be a bit more complex).

Ok understood, I will have a look that the file that you mentioned.

I have one more quick question: is AutoTVM going to be supported to optimize offloaded operators?

It’s not something I’ve seen any proposals for yet. However, my assumption would be no because AutoTVM acts on TVM functions (that is, operators that go through TVM’s code generation).

For the detail about the API/interface for vendor compilers, it is suggested to take a look at the tutorial.

For AutoTVM, we do not support auto-tuning for BYOC now. LIke @mbaret has mentioned, AutoTVM is targeting to TVM functions. One sentence for AutoTVM is: finding the best config from a tuning space defined in a given TVM schedule function (e.g., TOPI schedule). In other words, AutoTVM cannot figure out a tuning space if the schedule implementation is not in TVM schedule primitives.

In long term, we might plan to propose a representation for vendors to specify a tuning space so that we can leverage AutoTVM to tune the performance for external codegen as well, but we currently do not prioritize this task due to the lack of bandwidth and driving applications.

1 Like

Ok, now I see the challenge regarding the AutoTVM support.

Now, after looking at some answer from @tqchen in the following post, I was wondering what are the differences when it comes to retarget TVM in terms of use cases and effort to implement between using BYOC vs creating a new backend in the target/source/ directory?

One of the major difference is that BYOC allows vendors to only generate code (mainly wrappers) that can be understood by their own backends without really exposing the backend/library details. For example, you can register your contrib codegen to TVM and generate a wrapper conv2d_ to call your own library. And you create a simple runtime to interpret the generate artifact, when you see conv2d you will invoke your own kernel.

This is different from what we have under src/target/source which are TVM compatible codegen tools and the generated code can be understood by TVM runtime for execution. This is because we have all added the schedules and computes for such kernels.

I have a follow up question regarding BYOC.

If I create a wrapper to call a Conv2D from my own library, should I also implement all other operators required by my model as well? or the existing backends can generate the code as usual for the other operators?. In other words I am wonder how can I actually execute parts of a Relay graph using my own library (e.g., Conv2D operators) and the rest using standard backends (e.g., LLVM for ARM/x86). From the tutorial of BYOC this is not yet clear to me.

Thanks!

Yep, you can use both your own library and TVM codegen for the rest :slight_smile:

Ok thats great as this is of course what makes the most sense but I was not 100% sure :slight_smile:

Is there any example that shows how this looks like using an actual model lets say in Tensorflow? I guess question goes in the direction of if there is any concrete example that shows how the partitioning is used?

We are discussing the way of annotating supported operators in [RFC] Op based annotation for external codegen. You are welcome to provide your thoughts :slight_smile:

I think BYOC assumes your new AI accelerator supports programming in C/C++. But if the new AI accelerator doesn’t support such feature, I think these steps might be necessary:

  1. set target=ext_dev and -device=new_AI_accelerator_name
  2. optionally, quantize the input model into a target precision that the new accelerator supports, e.g. int8, bfloat16
  3. define instruction layout (load, compute, store etc.)
  4. implement host-device memory interface (e.g. dma_copy: dram->sram, sram->dram)
  5. implement runtime and driver for your target device, in order to properly handle instruction execution sequence and dependency

Hi @comaniac, Thanks for the pointer to the RFC! I will go through the discussion :slight_smile:

Hi @liangfu,

Thanks for sharing the steps. Actually, I have been considering both programmable (C/C++) and non-programmable AI accelerators. Then how non-programmable AI accelerators can be supported in TVM is a very interesting and useful feature in TVM.

However, I though BYOC could already be used to target non-programmable AI accelerators as long as there is a library that allow to access them from C/C++?

That’s correct, hence, the library should at least contain runtime and driver for the accelerator.

@comaniac you mentioned in [RFC] Op based annotation for external codegen the following:

“Currently, we expect users to write a custom pass to annotate a Relay program and then send it for partitioning.”

Can you please point me to some example of this annotation pass in which I can tell the ops that should by implemented using the codegen from BYOC?

@mbaret mentioned above that the annotation mechanism is not yet there and currently is a painful manual process. So I am bit confused about what can I do at this point to annotate my Relay IR to use BYOC.

Thanks

@tico See the test cases in my PR https://github.com/apache/incubator-tvm/pull/4741 This is exactly what “custom annotation pass” + partitioning is about. It also demonstrates the “painful” process you mentioned.

The good thing is that anything is possible if you go the hard way :slight_smile: But the composite + composite aware annotation should make my PR way simpler.

2 Likes

@masahi thanks for the pointer. This is was I was looking to try the external codegen to offload ops to an accelerator until the op based annotation is ready, which hopefully makes this process easier.

@comaniac @liangfu I manage to create an external codegen and annotate the relay IR of my model using compiler_begin and compiler_end to offload specific ops.

However, I was wondering if there is any mechanism already in place to avoid expensive memcpy’s between TVM and the external library. In my case, on the platform I have a particular way to allocate memory so that is shared between the host and the ai accelerator. One solution is that I could potentially use that allocator in TVM for all tensors and in that way I wont need memcpy but maybe there is a better solution in the context of BYOC

@tico The ir_pass.py script in VTA might be helpful to you. It transforms the IR to the dma_copy instructions to achieve similar goal.