[RFC] [ETHOSN] Arm Ethos-N integration

Leo-arm · May 15, 2020, 7:46am

Motivation and Scope

The Arm Ethos-N series is an high throughput, low area neural network processor for ML inference from cloud to edge to endpoint. This processor and software driver stack supports a variety of popular neural networks, including CNNs and RNNs, for classification, object detection, image enhancements, speech recognition and natural language understanding. Arm has recently open-sourced the ethos-n-driver-stack. The intention of this RFC is to integrate the driver stack into TVM so operations supported by the stack can be offloaded to the Ethos-N neural network processors.

Proposal

Over the past several months, work has been ongoing in the area of graph partitioning. We propose to build on top of this work by defining merge-composite patterns that partition the Relay graph into sections that can be offloaded by the bring-your-own compiler (BYOC) infrastructure. The Ethos-N driver stack provides a compiler front-end, the Support Library (SL), that accepts a graph structure similar to the Relay graph structure. The “compile” phase of the BYOC passes the Relay operators to the SL which builds an internal graph. This graph is then compiled into a command-stream; a description of processing steps required to execute the inference on the Ethos-N processors. The command stream is included in the generated module as a blob.

The packed function that is also generated by the the BYOC infrastructure calls into a runtime Inference function, passing in the command stream. This functions sets up the necessary buffers if required. The command stream is then executed by a driver library included in the ethos-N driver stack.

A conversion needs to take place between the Relay operators and the SL operators, e.g. for tensor descriptors and attributes, and some operations in Relay are combined in the SL. This conversion takes place when the composite functions are processed and handed over to the SL.

TVM supports a larger range of operators that the Ethos-N processor. In order to determine what is supported on Ethos-N, the SL supports an IsSupported() query mechanism. This will be used in the existing “check functions” as implemented in PR 5261.

The integration requires changes in several areas.

Build system

The driver stack code can be cloned from the GitHub repository. A build script similar to for example the existing Vulcan support builds the driver stack libraries for use in TVM. The Ethos-N support in TVM can be enabled by adding a path to the USE_ETHOSN configuration variable. This causes the build process to pick up the required header files and libraries and compile-in the support for Ethos-N, enables the relevant tests, and enables the graph partitioning code to detect Ethos-N compatible operations.

Operator support

Partitioning pattern definitions are created for the operators that are supported in Ethos-N to cause them to be picked up by the graph partitioning code. A layer in between the graph partitioning code translates the graph partitions, the composite functions, from Relay to the Ethos-N compatible formats and adds the converted operators to the Ethos-N support library. The partition is then compiled, resulting in a command stream. The command stream and the constant data (weights), if any, is added to the generated module for this partition.

Runtime support

The packed function that is compiled for each graph partition calls into a packed function in the TVM runtime to do the heavy lifting. It passes in the command stream for the section of the graph it is concerned with, and the input and output tensors. The runtime function sets up buffers using information stored in the command stream and calls into the Ethos-N driver library to execute the inference. The result of the inference is passed back as usual.

Testing

There are two sets of tests. The network tests test a network end to end and assume the hardware is available. These test push a network through and compare against known good results. The Ethos-N driver stack is required; the tests will be disabled if this is not available.

The unit tests test the individual operator sequences that can be offloaded to the Ethos-N processor. These tests do not need hardware to run and are enabled when the driver stack is available. They use a small Relay graph as a model, partition this and run an inference with random data. They do this once for the CPU and once for the Ethos-N processor. The results for the CPU are passed into the runtime inference code via a backdoor mechanism. When the actual inference is run through the Ethos flow, these results are passed back, simulating a hardware inference. This allows end-to-end testing of the TVM integration for each of the supported operators.

Code locations

Build system: cmake/modules/contrib/ethosn.cmake, cmake/util/FindEthosN.cmake

Compiler code: src/relay/backend/contrib/ethosn. Parsing of graph partition, conversion into SL data structures, compile into module.

Runtime code: src/runtime/contrib/ethosn directory. Run an inference given a command stream and input/output tensors.

Unit test code: tests/python/contrib/test_ethosn directory conforming to pytest.

Network test code: tests/python/contrib/test_ethos_compiler.py also in purest format.

Phasing

In order to facilitate code review the code changes are split into a number of PRs.

Unit test support for conv2 operator. This is the minimal amount of code that can work end to end.
Build support. This includes CMake support and updates to scripts in tests/scripts/task_config_build_cpus.sh, driver stack build, minor changes in docker scripts.
Runtime support. This is the inference code in the runtime.
Unit test support. This is the directory that contains the common test code and a test for conv2d.
Full operator support for Mobilenet. Complete the unit tests with all necessary operators with a PR issued for most operator separately.
End to end test for Mobilenet. This cannot be fully tested without hardware support but we will add a round-trip test that re-uses results from a CPU execution so the flow can be tested end-to-end, as described above.
IsSupported() support, based on PR BYOC #5261.

The following steps add support for more operators and networks. The required changes follow the same pattern: add compiler code, unit tests for operators, add a network test once a network is supported. Most if not all of the changes are in the area of front-end compiler support and appropriate tests.

We intend to track the BYOC infrastructure development in TVM as it happens as this work is heavily reliant on it.

As always, comments and suggestions are more than welcome.

zhiics · May 15, 2020, 4:32pm

@Leo-arm Thanks for the proposal and the interest in BYOC. I have a few questions, 1) are you using the CSourceModule runtime/serialization or something different? 2) Is the codegen toolchain ACL and do you plan to set the CI for testing because I see there are several stages for testing?

comaniac · May 15, 2020, 4:51pm

Thanks for the RFC. While I’m also interested in the questions @zhiics raised, I have a few more questions:

It seems like the major annotation process would be done by composite functions. Will your flow use patterns to form composite functions only? Or you will still have a list of supported single operators.
You mentioned that the codegen will generate a command stream. Would you ellaborate a bit more about the command stream? Is it a sequence of assembly code like other processors, or it’s morel ike a bit-stream to program FPGAs?

Thanks.

Leo-arm · May 16, 2020, 9:53am

@zhiics We have our own codegen, not based on CSourceModule. This is pretty simple as these things go because the Support library takes care of a lot of this work.

This is entirely separate to ACL. We intend to use the CI for some of the testing. We can test the graph partitioning for Ethos-N with the Ethos-N GitHub repo, using the backdoor mechanism described. This obviously does not run a real inference on real hardware but it will allow near-end-to-end unit tests for the partitioning and module generation. We’ll take care of pulling down the repo, so this is an automagic thing as far as CI is concerned. Obviously, this is only enabled if Ethos-N support is enabled.

I am trying to keep the size of the PRs small. The conv2d operator part (+ build support + runtime etc) is the smallest part we can upstream that still gives something that is usable by itself. The rest will follow on an operator by operator basis, or small batches. That makes it a bit easier to review. There is no other reason for staging.

Leo-arm · May 16, 2020, 10:19am

@comaniac We mostly use pattern tables now but it depends. Why the question?

The command stream is a serialised form of a graph containing instructions and parameters for the operators. It is more high-level than CPU instructions; it operates on the level of DMA, Conv2d etc. but in concept it is similar.

comaniac · May 16, 2020, 10:27pm

@Leo-arm thanks for the response.

We mostly use pattern tables now but it depends. Why the question?

Just out of curious because we are planning to rewrite the composite pass using the recent merged pattern language. I’m considering that if your flow only leverages composite functions, then how we do pattern matching would be even more important than merging annotation regions.

The command stream is a serialised form of a graph containing instructions and parameters for the operators. It is more high-level than CPU instructions; it operates on the level of DMA, Conv2d etc. but in concept it is similar.

Got the point. That’s also why you don’t base on CSourceModule.

Leo-arm · May 18, 2020, 7:18am

I figured that was the reason you asked. I looked at the pattern language when it is was first announced and did not see anything particularly worrying at the time. We use merged-composite in a few places so I’ll look at it again in the next days to see if there are any issues we can anticipate.

mbaret · May 18, 2020, 8:56am

@comaniac Regarding composite vs. single operators, we actually have more single ops than patterns. The composite functions appear more frequently though pretty much entirely because of qnn.conv2d.

From our perspective, composite pattern matching and merging annotation regions are no more or less important than than the other - they’re both essential. This is because the Ethos-N compiler (Support Library) is meant to operate on subgraphs, not single NPU operators.

manupa-arm · May 18, 2020, 9:23am

@comaniac, I believe when you are saying you are going to re-write the merge-composite with the pattern language that means you are essentially going to replace the pattern tables with patterns from the language (as long as the interface is concerned). Correct ?

I also dont see any problems as long as the patterns could be expressed in the language (which I believe it can – prolly it might end up needing fewer patterns). Agreed with @mbaret, the merging of regions is complementary to patterns as it is not a replacement to express larger patterns but more of combining patterns (multi and single ops) respecting data dependencies between them.

comaniac · May 18, 2020, 5:02pm

@manupa-arm that’s the ultimate goal to solve 2 problems currently the merge composite pass facing:

As you pointed out, the current approach to specify the composite pattern requires users to make lots of similar patterns. This problem was reported by @masahi and @jonso if I remember correctly. With the pattern language, we will have a unified and general pattern specification.
The current merge composite pass includes a pattern matching algorithm, but it actually uses a small Relay graph as a pattern to perform pattern matching. This, however, would encounter many issues when handling Relay graph node attributes. On the other hand, the graph nodes of pattern language include all required information for matching, and the pattern language infra covers necessary analysis such as dominator analysis to make the pattern matching more robust.

Consequently, we should use pattern language infra to implement merge composite. By doing so, 1) we don’t have to worry the matching algorithm anymore, and 2) users only need to learn one unified pattern language that TVM will use everywhere in the future.

btw, the merge composite related discussion is not directly related to this topic so I’d suggest we stop the merge composite discussion and focus on the Ethos-N here (@Leo-arm sorry about that!). Anyone interested in the merge composite is welcome to raise a new topic.

Leo-arm · May 19, 2020, 7:24am

@comaniac No problem at all. This discussion is very relevant to the Ethos-N integration. I touched base with everyone here and I don’t think there are concerns for us at this point other than the ones discussed.

aca88 · May 26, 2020, 5:55am

Hello,

I have a more general question relating to the decision of using the BYOC infrastructure.

The Ethos-N coprocessor, to some extent, works similar to VTA since that is also the level of programming the VTA accepts. VTA is integrated into TVM at a lower level than BYOC, namely below the TE/TIR level.

I would assume that ARM wants to dock onto TVM’s BYOC to leverage some previous in-house development (ARM-NN, CMSIS NN, etc), but I am interested in knowing what were further reasons not to go more VTA-style integration. Especially since AutoTVM is (at least to my current understanding) only works at the TE/TIR level.

Thanks

wda · May 26, 2020, 6:58am

I have a question why you choose BYOC. We can’t use AutoTVM in BYOC and I think AutoTVM is the very import feature of TVM.

Leo-arm · May 26, 2020, 7:38am

You are right; we leverage the existing compiler/optimiser (the support library) for Ethos-N, see https://github.com/ARM-software/ethos-n-driver-stack. Integration at the relay level makes sense for the type of operations that are implemented in Ethos-N. The accelerator is always faster than anything else so there is no need for auto tuning the operators that are greedily handed off at the TVM level, and the optimisation of those operators is done inside the support library as it requires detailed hardware knowledge.

wda · May 26, 2020, 9:14am

So I want to konw what’s the benefit about BYOC. I list some advantanges:

1: if the new backend doesn’t support all ops, we can back to cpu

2: use rely to support different frameworks

3: some graph level optimizes(can we use all graph level optimizes?)

But tvm maybe too heavy for newbackend.

Leo-arm · May 26, 2020, 11:51am

You covered the main advantages. You could add a thriving OSS community. With BYOC it is fairly easy to add a new backend, depending on exactly what you want to do. If one of the TVM runtimes matches your use-case then it is worth considering. It also depends of course on what alternatives you are considering.

tqchen · May 26, 2020, 7:31pm

I want to followup that the general infrastructure of pattern matching and rewriting does not conflict with AutoTVM.

It is important to follow a composite view of the infrastructure, where BYOC becomes natural feature by combining parts of the infrastructure together, rather than a monotholic piece. Specifically, BYOC involves a few steps

S0: Graph pattern detection and partitioning
S1: Sub function customized transformation
S2: Sub function to customized code backend

Only S2 is really specific to the hardware backend. For example, we could use the generic pattern language to perform S0 and S1 and still use the TIR for further lowering. This kind of compositability is where we are heading towards

mbaret · May 26, 2020, 3:55pm

By this do you mean, for instance, paritioning a graph for GPU/CPU with both going via TIR lowering?

tqchen · May 26, 2020, 3:57pm

Yap, or choose TIR lowering for some sub-functions

Leo-arm · May 27, 2020, 7:28am

Good point; I totally agree. AutoTVM and BYOC are orthogonal. It is simply the case that we don’t use AutoTVM for our backend because the optimisation occurs in the support library.