Introducing Hexagon backend

kparzysz · May 1, 2019, 10:15pm

We have implemented basic support for generating and executing Hexagon code from TVM. We would like to upstream our work, and we are asking for community’s feedback about how to proceed.

Below is some background information about Hexagon itself, how it works, and what we’ve done. The “Summary” section at the end lists the main points with questions. We are looking forward to your comments.

Background

Hexagon DSP is a 32-bit processor with its own ISA, and despite being called “DSP” it is really no different from a general purpose processor. In other words it can execute any kind of code, not just signal-processing computations, although with the addition of HVX, it is well suited for computational workloads. In practice in Snapdragon SoCs it always appears as a “subsystem”, where the main CPU is some version of ARM/AArch64, never as the CPU itself. As a consequence all communication with the Hexagon processor is done via the ARM CPU.

The communication mechanism between ARM and Hexagon is called FastRPC. Any function that is callable across processor boundaries needs to have a definition in an IDL, and for the set of IDL definitions, the build process in the end will produce two shared libraries: one for the ARM side (called “stub”) and the other for the Hexagon side (called “skel”). The IDL compiler is included in Hexagon SDK.

There is also a Hexagon simulator (for Linux on x86), which allows users to run Hexagon code without Hexagon hardware. The simulator exists as a standalone program, and as a library that can be linked into an x86 Linux binary. The simulator library provides a mechanism for the x86 application to access the memory of the simulated process, which can serve as another cross-processor communication mechanism (although much different from FastRPC). The simulator is a part of Hexagon toolchain, which is also a part of the Hexagon SDK.

While the ARM CPU (typically) runs Android, the OS functions on Hexagon are rather limited. On hardware, the Hexagon processor runs a special RTOS, and the simulator implements a set of built-in system calls, but neither is as developed as, for example, Linux.

Implementation design

The current implementation utilizes Hexagon in a somewhat limited way: it follows the way of a GPU, i.e. only computational kernels (loop nests marked as “pipeline”) are offloaded. The primary reason for it was to avoid having to support the TVM runtime on Hexagon. In the longer term the goal is to offload entire subgraphs to Hexagon. For the sake of testing, it should be possible to offload the entire graph to Hexagon.

We use LLVM to generate Hexagon code from TVM IR.

We support both execution environments: simulator and hardware. The hardware must run Android, and the SoC must have a Hexagon CDSP (we don’t support ADSP at the moment).

The simulator always executes a complete ELF binary, just like binaries that are executed from a Linux shell. In order to facilitate loading modules (i.e. shared libraries) and executing functions from them, we implemented a simple driver (or a “simulator runtime”). That runtime is a process that waits for commands from the program that instantiated the simulator (that is, the TVM runtime) and implements the simulator’s end of the DeviceAPI as well as the ability to load shared libraries and run functions. The executable for this process must be compiled by the Hexagon toolchain. The TVM itself can be compiled with any C++ compiler.

For running on hardware, the IDL libraries must be built. First, the IDL definitions must be processed via the IDL compiler (qaic, included in the Hexagon SDK). This will produce a set of C sources. These sources must then be compiled with an Android toolchain (to build the “stub” library), and by the Hexagon toolchain (to build the “skel” library). The TVM runtime (only runtime), must then be compiled with the NDK compiler.

The Hexagon toolchain is based on Clang. In all builds we use libc++ (not libstdc++) as the C++ library implementation.

For either simulator or hardware, we need to create a persistent object representing the state of the “executor”. Currently we do it via a global variable, but maybe there is a better way.

Structure of the code

We introduced a new device type: kDLHexagon, and a new target “hexagon”.

Most of the code is in src/runtime/hexagon, except for the IDL definitions and build scripts that we currently mainain in a separate repository. Ideally, we’d like to have it all in the TVM repository and we welcome feedback on how best to do this.

The minimum requirements for building TVM with Hexagon support depend on whether the execution environment is the simulator or hardware. For running on simulator, Hexagon toolchain v8.3 is needed, for running on hardware, Hexagon SDK 3.4.3+ and Android NDK r19 are required. Note: Hexagon toolchain 8.3 is included in the Hexagon SDK.

Miscellaneous

Hexagon HVX (vector engine) requires that all vectors are aligned in memory to a 128 byte boundary. Because of that, we have changed the following variables to 128:

kAllocAlignment
kTempAllocaAlignment

We look for a way to better implement it than having it hardcoded.

Summary

We implemented Hexagon as a device (kDLHexagon), added target “hexagon”, added basic codegen via LLVM, implemented Hexagon runtime, and implemented a set of schedules.
Should we divide it up into smaller patches for review? Are there any suggestions as to what each patch should contain?
We support Hexagon simulator and Android hardware as execution targets. Both require special steps when building TVM, both can be assumed to require at least Hexagon SDK to be installed ahead of time.
How do we incorporate it into the build system? Should we ask users to run something like “make hexagon-prepare” (to build the IDL and the simulator driver), or should all the steps be done with just “make”?

cc: @FrozenGene, @tqchen

tqchen · May 1, 2019, 10:57pm

Thanks for the RFC, this looks exciting, here are some comments:

I would recommend sending the PR in the following steps

Runtime: everything under runtime folder
Compiler: support for hexagon codegen
AutoTVM templates, topi support

Build system

Currently, TVM uses CMake, it would be great to add options USE_HEXAGON option just like other option, when USE_HEXAGON equals sim, cmake should be able to build the IDL and simulator driver together and perform the linking step. I assume at least for simulator this is something that is do-able via cmake.

Please also help prepare an installation script to install Hexagon SDK to the docker(as part of the runtime PR), we could consider add it as a separate env Dockerfile.hexagon https://github.com/dmlc/tvm/tree/master/docker
We will use this for integration testing via simulator.

For android, I can see why things could be more tricky, and we could have a separate command that prepares the idl component for android.

srkreddy1238 · May 2, 2019, 4:49am

Good to hear about Hexagon with open source DL runtime apart from Qualcomm’s in house NPE SDK.

Hexagon runtime with quantization and heterogeneous execution TVM can be a great value for Qualcomm platforms.

FrozenGene · May 2, 2019, 5:02am

Thanks for the great work!

I think we could use macro to control alignment. For example Tianqi’s response ‘USE_HEXAGON’. If defined this macro, we make them be 128, if not be 64 as original.

I agree with Tianqi’s suggestions. The kHexagonDeviceAPI is import and isolated. Then we could add codegen part, then we could run. Finally we could have topi schedule.

The most important thing is show how we run end-to-end model. For example mobilenet.

I have one quick question: Why we restrict Android? Hexagon SDK has Linux toolchain too. Because our dsp will run on Linux.

yidawang · May 2, 2019, 6:22am

Thanks for the RFC! Just for my self-education purpose, what is the theoretical peak FLOPS of Hexagon and what is your observed FLOPS when running CONV on it?

tqchen · May 2, 2019, 2:03pm

For consistency reasons, I think it is fine to increase the alignment requirement by default. Note that this will only cause a bit fragmentation for small allocas, but usually most temp allocation is big.

FrozenGene · May 2, 2019, 3:10pm

However, in GetTempAllocaAlignment function, it will reduce alignment / 2 in while loop. I am worried about this will affect DSP’s rule, it requires 128 bits alignment.

tqchen · May 2, 2019, 3:22pm

OK that makes sense and we might need a special target aware function for that

FrozenGene · May 2, 2019, 3:32pm

GetTempAllocaAlignment function maybe should be modified for special DSP target and return 128(kTempAllocAlignment) directly.

kparzysz · May 3, 2019, 6:23pm

I’m working on some changes and will have patches for review soon.

kparzysz · May 9, 2019, 3:47pm

Patch #1 is ready. This is the runtime support for Hexagon: https://github.com/dmlc/tvm/pull/3163

frankgt · June 19, 2019, 8:05am

It seems the support for Hexagon CDSP is not ready now. Is there any plan, e.g. to add the support for Hexagon CDSP in the last release of TVM.
Another question, besides CDSP, there is also a dedicated NPU in SDM8150, will Qualcomm add the TVM support for NPU in the future?

kparzysz · June 19, 2019, 5:54pm

There was a delay caused by the need for another legal review (since there were comments on the license headers in the patch).

I will post an updated patch within a few days.

kparzysz · June 27, 2019, 1:22pm

Update: I posted an updated patch two day ago. See https://github.com/dmlc/tvm/pull/3163.

LanTn · July 1, 2019, 3:48am

hello ,I briefly looked at the code.I have a question.you use LLVM to generate Hexagon code from TVM IR.Why add a Hexagon target directly instead of adding it under LLVM like X86 or ARM?

kparzysz · July 1, 2019, 5:08pm

This patch only contains runtime support for Hexagon. There is codegen also, but it’s not included here (and it is in src/codegen/llvm)

LanTn · July 2, 2019, 3:06am

My current problem is that I don’t see special X86 or ARM support in the runtime’s code.

kparzysz · July 2, 2019, 1:47pm

The special runtimes are mostly for offloading to extra devices from host. ARM and X86 are usually the hosts, so the runtime for them is fairly simple. It’s still there, look at cpu_device_api.cc for example.

csuhawk · August 13, 2019, 8:23am

I think if adding copyonright within hexagon code or not will not be a problem cause that these code will only used on Qualcomm soc.

FrozenGene · September 24, 2019, 6:35pm

Maybe a little bit later. But I think there is one slide is very good: http://pages.cs.wisc.edu/~danav/pubs/qcom/hexagon_hotchips2013.pdf
And Hexagon DSP has good support for fixed-int computation, but not be suitable for float arithmetic. Like hexagon has some instructions can be very useful for convolution, for example vrmpy, which could be used for dot.

And according to my test, SNPE on 8155 Hexagon could execute mobilenet v2 quant model in less than 5ms.

One more thing, thanks for the help and discussion of @kparzysz, based on his contributed runtime pr, I have completed TVM support of hexagon (LLVM hexagon CodeGen, schedule, hexagon parallel support and so on), but I must admit that we have big gap compared with SNPE, I am working on it. When time is suitable and follow up @kparzysz, I could contribute back to upstream.