Note: This is a repost/follow-up to a question I originally sent to Tianqi.
I am working on adding support for the Hexagon processor to TVM and I ran into a problem that will likely more work, so I need some guidance.
A little background: Hexagon is a full-fledged processor (like a CPU), but in actual devices (phones, etc) it is never the main CPU. The main CPU is typically some variant of an ARM processor. The execution model that we are targeting is that the graph runtime will run on the ARM, while the nodes will be executed on Hexagon. There is some performance penalty for doing cross-processor calls, so we really need to minimize the number of times that the ARM calls functions on Hexagon.
The flip side is that Hexagon typically doesn’t have a lot of memory, and it only has a very basic operating system support, so complex things like Python runtime are generally not available.
I got things to work (in terms of getting correct answers) with only offloading computation kernels, following the GPU approach, and adding Hexagon as a device_type to TVMContext. This generates a number of calls from CPU to Hexagon, so it won’t work in practice. In the end we’d want to offload entire subgraphs, but what I’m trying to do now is to develop a prototype where the entire graph is offloaded to Hexagon, and I’m running into issues here.
- I hacked _build_for_device in python/tvm/build_module.py to mark every function as DeviceFunc when the target is Hexagon. I can live with this for now, but obviously this won’t work in the long term.
- There was an assertion somewhere, expecting the number of functions targeted for the CPU to be non-zero. I worked around this by adding a phony function to the CPU list that only returns 0.
- The graph runtime expects a function for each graph node to exist on the CPU. With every function designated as “device”, they are all compiled for Hexagon and none of them exist in the CPU module. I suppose that the entire graph should be collapsed into a single node in this case, but but I don’t know where this type of a transformation should be done.
Tianqi’s response:
Build a minimum tvm runtime(include the graph runtime) on the Hexagon itself, which exposes a few C APIs that allows you to get a “remote” handle(of function/array/module) and run execution. Then we can create a driver module from the host(ARM) that talks to the Hexagon module via the cross CPU calls. In this way, we can get Hexagon functions which is graph_runtime.run, and that single function executes all the kernels. There are two related runtimes modules in TVM that is implemented in this way:
- RPC module: https://github.com/dmlc/tvm/tree/master/src/runtime/rpc which talks to a remote device. We don’t really need RPC because I assume Hexagon can directly visit the ARM’s address space. But the protocol can serve as a reference
- SGX runtime: https://github.com/dmlc/tvm/tree/master/src/runtime/sgx which have an in-enclave tvm-runtime, and then the outside module talks to the in-enclave runtime via a C API.
Both of them can serve as a good reference, likely the on Hexagon runtime will expose the same C API that normal TVM runtime exposes. Then the on-ARM runtime just wraps the on Hexagon one to provide the same interface to the user:
For example:
- ARMModule.GetFunction calls HexDeviceModule.GetFunction, which gets back a function handle
- DeviceAPIOnArm.Allocate calls HexDeviceModule.Allocate to allocate the memory space
- ARMModule.Function.operator() set up the necessary arguments on to the argument stack, and then invokes the handle on the Hexagon.
I have further questions:
GraphRuntime::SetupOpExecs walks the list of graph nodes and creates a callable object for each one. GraphRuntime::Run would then walk over the list of these objects and invoke each one of them. What would this look like if we were to support offloading subgraphs? The goal is to minimize calls across processor boundaries, so I’m imagining something like
GraphRuntime::Run() {
for (s = 0; s < num_function_groups; ++s) {
b = function_groups[s];
r = get_runtime_for(b); // CPU or offload target
// Ignore argument passing for the sake of clarity.
r.exec_group(b.begin() ... b.end()); // Single call to execute a batch of functions.
}
}
I’d like this to be a generic solution, not something specific to Hexagon.
Tianqi’s suggestion to create a driver module for host seems simpler, but only in the case where all functions are offloaded. Otherwise, that driver would have to have the above logic. Also, the graph on the side of the host would need to be reduced to a single node, wouldn’t it?
In summary:
- If there was a general mechanism to offload entire subgraphs, what would it look like on a conceptual level? Is there any support for having that functionality?
- If any graph node(s) was to be offloaded, would the graph on the host need to be updated?
- If I had a graph runtime on Hexagon, would it essentially be GraphRuntime, but compiled for Hexagon, or just something that implements “run”?