Offloading subgraphs to Hexagon

kparzysz · November 1, 2018, 8:26pm

Note: This is a repost/follow-up to a question I originally sent to Tianqi.

I am working on adding support for the Hexagon processor to TVM and I ran into a problem that will likely more work, so I need some guidance.

A little background: Hexagon is a full-fledged processor (like a CPU), but in actual devices (phones, etc) it is never the main CPU. The main CPU is typically some variant of an ARM processor. The execution model that we are targeting is that the graph runtime will run on the ARM, while the nodes will be executed on Hexagon. There is some performance penalty for doing cross-processor calls, so we really need to minimize the number of times that the ARM calls functions on Hexagon.
The flip side is that Hexagon typically doesn’t have a lot of memory, and it only has a very basic operating system support, so complex things like Python runtime are generally not available.

I got things to work (in terms of getting correct answers) with only offloading computation kernels, following the GPU approach, and adding Hexagon as a device_type to TVMContext. This generates a number of calls from CPU to Hexagon, so it won’t work in practice. In the end we’d want to offload entire subgraphs, but what I’m trying to do now is to develop a prototype where the entire graph is offloaded to Hexagon, and I’m running into issues here.

I hacked _build_for_device in python/tvm/build_module.py to mark every function as DeviceFunc when the target is Hexagon. I can live with this for now, but obviously this won’t work in the long term.
There was an assertion somewhere, expecting the number of functions targeted for the CPU to be non-zero. I worked around this by adding a phony function to the CPU list that only returns 0.
The graph runtime expects a function for each graph node to exist on the CPU. With every function designated as “device”, they are all compiled for Hexagon and none of them exist in the CPU module. I suppose that the entire graph should be collapsed into a single node in this case, but but I don’t know where this type of a transformation should be done.

Tianqi’s response:

Build a minimum tvm runtime(include the graph runtime) on the Hexagon itself, which exposes a few C APIs that allows you to get a “remote” handle(of function/array/module) and run execution. Then we can create a driver module from the host(ARM) that talks to the Hexagon module via the cross CPU calls. In this way, we can get Hexagon functions which is graph_runtime.run, and that single function executes all the kernels. There are two related runtimes modules in TVM that is implemented in this way:

RPC module: https://github.com/dmlc/tvm/tree/master/src/runtime/rpc which talks to a remote device. We don’t really need RPC because I assume Hexagon can directly visit the ARM’s address space. But the protocol can serve as a reference

SGX runtime: https://github.com/dmlc/tvm/tree/master/src/runtime/sgx which have an in-enclave tvm-runtime, and then the outside module talks to the in-enclave runtime via a C API.

Both of them can serve as a good reference, likely the on Hexagon runtime will expose the same C API that normal TVM runtime exposes. Then the on-ARM runtime just wraps the on Hexagon one to provide the same interface to the user:

For example:

ARMModule.GetFunction calls HexDeviceModule.GetFunction, which gets back a function handle

DeviceAPIOnArm.Allocate calls HexDeviceModule.Allocate to allocate the memory space

ARMModule.Function.operator() set up the necessary arguments on to the argument stack, and then invokes the handle on the Hexagon.

I have further questions:

GraphRuntime::SetupOpExecs walks the list of graph nodes and creates a callable object for each one. GraphRuntime::Run would then walk over the list of these objects and invoke each one of them. What would this look like if we were to support offloading subgraphs? The goal is to minimize calls across processor boundaries, so I’m imagining something like

GraphRuntime::Run() {
  for (s = 0; s < num_function_groups; ++s) {
    b = function_groups[s];
    r = get_runtime_for(b);  // CPU or offload target
    // Ignore argument passing for the sake of clarity.
    r.exec_group(b.begin() ... b.end());  // Single call to execute a batch of functions.
  }
}

I’d like this to be a generic solution, not something specific to Hexagon.

Tianqi’s suggestion to create a driver module for host seems simpler, but only in the case where all functions are offloaded. Otherwise, that driver would have to have the above logic. Also, the graph on the side of the host would need to be reduced to a single node, wouldn’t it?

In summary:

If there was a general mechanism to offload entire subgraphs, what would it look like on a conceptual level? Is there any support for having that functionality?
If any graph node(s) was to be offloaded, would the graph on the host need to be updated?
If I had a graph runtime on Hexagon, would it essentially be GraphRuntime, but compiled for Hexagon, or just something that implements “run”?

tqchen · November 1, 2018, 8:27pm

OK, to answer your question wrt to 3, if you have a graph runtime on Hexagon, yes it would be simply compile tvm runtime(including graph runtime) to Hexagon. I can understand the thoughts on why you want the more generic way (see my followup post).

tqchen · November 1, 2018, 8:29pm

OK, now to followup on the case generic function call. I agree that there is an interesting issue of cross processor call which we did not consider so far. Our current approach is fine for traditonal GPU/CPU because these calls are handled asynchronously by the driver so the multiple calls won’t be too much of a problem (there is no synchronization cost).

Here is a possible second thought that is a bit more general:

Make a small “driver” for Hexagon runtime that “lazily” calls into hexagon cores, basically, we queues up the function handle and arguments in a execution queue.
We only actually run the offloading when there is a “synchronization” happening
- 1. When there is memory copy in/out from the hexagon
- 1. When synchronize is called.

This way we offload a sequence of packed function calls into the device and there is no concept of the graph necessary. The only drawback though is that we will still need to record the packed calls to the queue on the fly, but given that all the arguments and handles are minimum, i feel this could be a good compromise

kparzysz · November 5, 2018, 8:55pm

I like the lazy evaluation idea, but it still needs some changes in the graph runtime on host. Consider this scenario:

        inputs
           |
        f0_off()
         /     \
        /       \
f1_host()        f2_off()
        \       /
         \     /
        f3_host()
           |
        outputs

Where _off are offloaded functions and _host are functions that execute on host. The desired execution is that where both of the offloaded functions execute in a single group. If I understand your idea correctly, the host would have stubs for f0 and f2, which would do the queuing. However, if the graph runtime executes the nodes as f0, f1, f2, f3, then the two offloaded functions will execute separately (since f1 needs the output from f0). At the same time, a functionally equivalent ordering of nodes would be f0, f2, f1, f3, which would allow the offloaded functions to execute together (in the same group). As it is now, there is nothing that could be done in the stubs for f0/f2 that would prevent the first scenario in favor of the second (that I’m aware of). Some form of graph partitioning would be needed, that finds an ordering that groups nodes offloaded to the same target together (subject to dependency constraints). Would such a change be acceptable?

tqchen · November 5, 2018, 11:43pm

what you said is true. And that does not require change to the graph runtime, we should instead add compiler support to generate the ordering of the nodes to do the reorder if necessary. We can certainly add a compiler pass to do so, and especially with the new nnvmv2 update this would be easier. In the meanwhile, we can by implementing the driver via lazy evaluation