[RFC] [ETHOSN] Arm Ethos-N integration

Thanks for the RFC. While I’m also interested in the questions @zhiics raised, I have a few more questions:

  1. It seems like the major annotation process would be done by composite functions. Will your flow use patterns to form composite functions only? Or you will still have a list of supported single operators.

  2. You mentioned that the codegen will generate a command stream. Would you ellaborate a bit more about the command stream? Is it a sequence of assembly code like other processors, or it’s morel ike a bit-stream to program FPGAs?

Thanks.

@zhiics We have our own codegen, not based on CSourceModule. This is pretty simple as these things go because the Support library takes care of a lot of this work.

This is entirely separate to ACL. We intend to use the CI for some of the testing. We can test the graph partitioning for Ethos-N with the Ethos-N GitHub repo, using the backdoor mechanism described. This obviously does not run a real inference on real hardware but it will allow near-end-to-end unit tests for the partitioning and module generation. We’ll take care of pulling down the repo, so this is an automagic thing as far as CI is concerned. Obviously, this is only enabled if Ethos-N support is enabled.

I am trying to keep the size of the PRs small. The conv2d operator part (+ build support + runtime etc) is the smallest part we can upstream that still gives something that is usable by itself. The rest will follow on an operator by operator basis, or small batches. That makes it a bit easier to review. There is no other reason for staging.

@comaniac We mostly use pattern tables now but it depends. Why the question?

The command stream is a serialised form of a graph containing instructions and parameters for the operators. It is more high-level than CPU instructions; it operates on the level of DMA, Conv2d etc. but in concept it is similar.

@Leo-arm thanks for the response.

We mostly use pattern tables now but it depends. Why the question?

Just out of curious because we are planning to rewrite the composite pass using the recent merged pattern language. I’m considering that if your flow only leverages composite functions, then how we do pattern matching would be even more important than merging annotation regions.

The command stream is a serialised form of a graph containing instructions and parameters for the operators. It is more high-level than CPU instructions; it operates on the level of DMA, Conv2d etc. but in concept it is similar.

Got the point. That’s also why you don’t base on CSourceModule.

I figured that was the reason you asked. I looked at the pattern language when it is was first announced and did not see anything particularly worrying at the time. We use merged-composite in a few places so I’ll look at it again in the next days to see if there are any issues we can anticipate.

@comaniac Regarding composite vs. single operators, we actually have more single ops than patterns. The composite functions appear more frequently though pretty much entirely because of qnn.conv2d.

From our perspective, composite pattern matching and merging annotation regions are no more or less important than than the other - they’re both essential. This is because the Ethos-N compiler (Support Library) is meant to operate on subgraphs, not single NPU operators.

@comaniac, I believe when you are saying you are going to re-write the merge-composite with the pattern language that means you are essentially going to replace the pattern tables with patterns from the language (as long as the interface is concerned). Correct ?

I also dont see any problems as long as the patterns could be expressed in the language (which I believe it can – prolly it might end up needing fewer patterns). Agreed with @mbaret, the merging of regions is complementary to patterns as it is not a replacement to express larger patterns but more of combining patterns (multi and single ops) respecting data dependencies between them.

@manupa-arm that’s the ultimate goal to solve 2 problems currently the merge composite pass facing:

  1. As you pointed out, the current approach to specify the composite pattern requires users to make lots of similar patterns. This problem was reported by @masahi and @jonso if I remember correctly. With the pattern language, we will have a unified and general pattern specification.

  2. The current merge composite pass includes a pattern matching algorithm, but it actually uses a small Relay graph as a pattern to perform pattern matching. This, however, would encounter many issues when handling Relay graph node attributes. On the other hand, the graph nodes of pattern language include all required information for matching, and the pattern language infra covers necessary analysis such as dominator analysis to make the pattern matching more robust.

Consequently, we should use pattern language infra to implement merge composite. By doing so, 1) we don’t have to worry the matching algorithm anymore, and 2) users only need to learn one unified pattern language that TVM will use everywhere in the future.

btw, the merge composite related discussion is not directly related to this topic so I’d suggest we stop the merge composite discussion and focus on the Ethos-N here (@Leo-arm sorry about that!). Anyone interested in the merge composite is welcome to raise a new topic.

@comaniac No problem at all. This discussion is very relevant to the Ethos-N integration. I touched base with everyone here and I don’t think there are concerns for us at this point other than the ones discussed.

Hello,

I have a more general question relating to the decision of using the BYOC infrastructure.

The Ethos-N coprocessor, to some extent, works similar to VTA since that is also the level of programming the VTA accepts. VTA is integrated into TVM at a lower level than BYOC, namely below the TE/TIR level.

I would assume that ARM wants to dock onto TVM’s BYOC to leverage some previous in-house development (ARM-NN, CMSIS NN, etc), but I am interested in knowing what were further reasons not to go more VTA-style integration. Especially since AutoTVM is (at least to my current understanding) only works at the TE/TIR level.

Thanks

I have a question why you choose BYOC. We can’t use AutoTVM in BYOC and I think AutoTVM is the very import feature of TVM.

You are right; we leverage the existing compiler/optimiser (the support library) for Ethos-N, see https://github.com/ARM-software/ethos-n-driver-stack. Integration at the relay level makes sense for the type of operations that are implemented in Ethos-N. The accelerator is always faster than anything else so there is no need for auto tuning the operators that are greedily handed off at the TVM level, and the optimisation of those operators is done inside the support library as it requires detailed hardware knowledge.

So I want to konw what’s the benefit about BYOC. I list some advantanges:

1: if the new backend doesn’t support all ops, we can back to cpu

2: use rely to support different frameworks

3: some graph level optimizes(can we use all graph level optimizes?)

But tvm maybe too heavy for newbackend.

You covered the main advantages. You could add a thriving OSS community. With BYOC it is fairly easy to add a new backend, depending on exactly what you want to do. If one of the TVM runtimes matches your use-case then it is worth considering. It also depends of course on what alternatives you are considering.

I want to followup that the general infrastructure of pattern matching and rewriting does not conflict with AutoTVM.

It is important to follow a composite view of the infrastructure, where BYOC becomes natural feature by combining parts of the infrastructure together, rather than a monotholic piece. Specifically, BYOC involves a few steps

  • S0: Graph pattern detection and partitioning
  • S1: Sub function customized transformation
  • S2: Sub function to customized code backend

Only S2 is really specific to the hardware backend. For example, we could use the generic pattern language to perform S0 and S1 and still use the TIR for further lowering. This kind of compositability is where we are heading towards

2 Likes

By this do you mean, for instance, paritioning a graph for GPU/CPU with both going via TIR lowering?

Yap, or choose TIR lowering for some sub-functions :slight_smile:

1 Like

Good point; I totally agree. AutoTVM and BYOC are orthogonal. It is simply the case that we don’t use AutoTVM for our backend because the optimisation occurs in the support library.

Why is the command stream a serialised form? not a executable ? Thanks

hi @uni-alex, I suspect that the Ethos-N is not a general-purpose processor, instead it is most likely an ASIP (application-specific instruction set processor). This means, that its ISA is limited to its very narrow field of application and it is probably controlled by the runtime on the system CPU. So the command stream is going to be processed by the runtime, and does not need to be an executable.