BYOC and the VTA missing link?

Over at the [RFC] [ETHOSN] Arm Ethos-N integration thread, there was an interesting point made by @tqchen. I wanted to expand on the matter, while not hijacking the other thread.

The reason I find this interesting is as follows:

  • BYOC has been a successful interface between TVM and vendors
  • VTA is the historical template of accelerator programming with full TVM stack
  • Yet the VTA codebase does not, currently, have a BYOC adaptation

It seems to me, that the most assumed usage of BYOC has been as accepting Relay as an input language to the dedicated codegens (see ARM and Hexagon). But as tqchen pointed out, there is nothing against doing the codegen hand-off at the TIR level. This will in turn allow an eventual usage of AutoTVM as part of the BYOC compilation flow.

I was wondering how a BYOC adaptation of the VTA example would look like? I think this would be a good way to generate better visibility of the TE/TIR level infrastructure of TVM and also important for adoption.

Just in terms of BYOC’s capabilities, it’s worth mentioning that one of the reasons all the codegens accept Relay rather than TIR is because BYOC is implemented in Relay. There’s no infrastructure currently to partition TIR functions and I don’t think it would be a simple extension (maybe the unified IR would help).

I agree with you that the current infrastructure seems to be limited to Relay. But tqchen did mention:

So we can assume that BYOC is at least planned for the TIR level. Since VTA shows the “original” way of coupling an accelerator to TVM. This would mean that a BYOC at TIR level could reuse much of the conceptual design used by VTA developers to dock onto TVM.

I think the most natural reason why BYOC will always start at the Relay level is because it is more often the case that accelerators are designed with the “Framework” operators in mind. So that an accelerator will encapsulate the execution of what we consider Relay subgraphs.

The bridge between Relay and TIR is the definition of the compute and scheduling rule for a given Relay operator/function. So a first step using BYOC-TIR:

  • the vendor would need to override standard TVM translation between Relay operators and their TOPI implementation

What I have described above is, to some extent, what you will find in the TOPI/nn folder in the codebase and to some extent in the VTA part of the stack.

The compute rule needs to have its corresponding scheduling rule at the TE level in order to generate TIR description. The TE design is obviously vendor specific. This is what you will find in the TOPI/ folder in the codebase. It would be interesting to discuss if vendor-specific TE extensions can be included in the BYOC concept.

For the TIR level, the Relay BYOC concept could be highly reuse (so defining TIR patterns which should be matched and so on). AFAIK, TIR pass infra is very similar to Relay pass infra and therefore decisions made at the higher level on how to “customize” the TIR passes could be adopted at the lower level. To some extent, VTA does this. My biggest concern is that some of the TIR passes are “triggered” by a pragma injection and others aren’t. I don’t know what the underlying reason was to divide them as such. What is missing, in the BYOC-TIR, is how to incorporate the pattern matching mechanism (similar in API has the on at Relay level) in order to ease this process (at the TIR level).

Given:

  • The accelerator can handle a composite pattern outside the capabilities of standard TVM fusion at Relay level. Therefore the vendor designs the composite pattern at the Relay level and at compute definition level

  • The accelerator can handle a composite pattern outside the capabilities of standard TVM at the TIR level (example: DMA load store). Therefore the vendor designs the composite pattern at the TIR level using TE

The BYOC-TIR compilation flow could follow this outline:

  • Framework is translated into Relay and standard TVM optimizations can be done here

    • Let TVM handle front end stuff (part 1 of frontend )
  • Relay workload is searched for Relay composite pattern, if found then “delete” the original Relay subgraph and “insert” the compute definitions designed by the vendor

    • Insert “your” special way of rewriting Relay graphs (part 2 of frontend)

      • For me this is the, to this date, most common docking onto TVM from outside and what is currently available in documentation
    • Create valid Relay graphs and continue compiling in TVM

      • I guess this could be optional, if you want to generate TIR from here
  • The compute definitions are lowered using vendor specific usage of standard TEs/vendor specific TEs/bypassing TE representation into a TIR representation

  • The TIR representation is adapted by patterns defined by the vendor (similar API to the patternlang available at Relay level), if found then “delete” the original TIR subgraph and “insert” the TIR generated/designed by the vendor

    • Insert “your” special way of handling TIR graphs (part 1 of backend)
      • Autotuner available if TE is used to generate these new TIR graphs
      • For example how VTA does it
  • TIR representation is further handled by standard TVM stack (part 2 of backend)

    • Array size computation/pointer arithmetic

    • Dead-code elimination

  • Continue lowering from TIR- >your codegen

The interesting thing is:

  • developing only part 2 of frontend and not continuing TVM flow is what, current, BYOC dcoumentation shows (obviously it still needs your runtime) this will be the path taken by vendors with high previous in-house development of a software stack

  • developing part 2 of frontend and part 1 of backend allows for a very specific concentrated effort while piggybacking on some core TVM functionality across the stakc. This will be the path taken by vendors with low previous in-house development/researchers/hobbyist

What are your opinions on this? @thierry