[RFC][µTVM] Standalone µTVM Roadmap

areusch · June 16, 2020, 9:12pm

This RFC outlines a high-level roadmap towards what we might consider a Standalone of µTVM (Micro TVM). In saying “standalone,” we are referring to a cohesive set of features that will enable a few end-user goals, one of them being standalone execution of optimized TVM models on-device.

In the coming weeks, I’ll be posting RFCs (as they’re written) for work that enables these goals. This RFC is meant to serve as context for those RFCs, as well as an overall place to discuss high-level goals of µTVM. I’m definitely interested in everyone’s thoughts on this overall direction for µTVM.

Goals

This roadmap aims to enable these potential end-user goals:

Test simple models on supported hardware without writing any microcontroller code. Simple models means: models without conditional execution that wholly fit in the device flash and which can be evaluated without reusing RAM. A user should be able to view execution time (timed accurately on device), code size, memory consumption, and model output. It should be feasible to test performance under different SoC configurations, though this may involve writing microcontroller firmware or e.g. tweaking RTOS settings.
Easily package a tested model (from #1) into a C library with the following properties:
- BYO memory allocator, plus a provided standard allocator with BYO buffer
- no malloc() calls (outside of calls to the internal memory allocator)
- graph-based runtime, but the graph can be fixed/compiled AOT
- can be configured to use the same supporting library functions as are used during autotuning/eval. specifically, this means that the same TVMBackend functions invoked during autotuning are also invoked in production.
Easily autotune supported operators without having to write too much TVM code beyond the model definition. You shouldn’t have to understand how TVM works to try autotuning.

Projects

We think these projects are the right ones to pursue in order to enable these goals. More detail for each of these projects will be given in RFCs to follow this one.

µTVM On-Device RPC Server (PoC)

Description: Following the RPC modularization PR, we propose to port the TVM C Runtime to bare metal targets, and use the MinRPC server to implement a (limited) TVM RPC server on-device using any pipe-like transport (i.e. UART, Ethernet, USB, semihosting, etc). This just implements the C++ RPCEndpoint on device, not other features implemented behind PackedFuncs, such as LoadModule, GraphRuntime, etc.

Rationale: TVM currently encodes device-specific memory layouts in the repository. In addition to this, TVM also needs to somehow specify the SoC configuration (i.e. oscillator, caches, power modes, etc) to reliably reproduce results. In order to scale past a few devices, effectively use flash, and take advantage of platform efforts such as Zephyr, mBED, Mynewt, and others, TVM should adopt a more portable µTVM compilation/linking strategy.
µTVM CI in TVM.

Description: Write a CI test for the On-Device Runtime against x86 and potentially simulated bare-metal implementations (I.e. qemu or other device emulators). Run the CI as part of the TVM pre-submit. We don’t intend to include real hardware in the TVM pre-submit. Outside of the pre-submit, we’d like to encourage use of the CI test to validate implementations of the on-device runtime on real hardware.

Rationale: Some CI test is needed to protect against breakages in the CRT on bare metal. The CI should be executable by all TVM contributors, since it will be in the pre-submit. The same test as is used in the pre-submit should be sufficient to validate real hardware.
Enable AutoTVM using the on-device runtime.

Description: Modify the AutoTVM build process to create µTVM On-Device Runtime binaries and flash them as is appropriate for the platform they’re using.

Rationale: AutoTVM needs to evaluate performance in scenarios that exactly mimic real-world device configuration.
Place Model Weights in Flash.

Description: Modify C codegen to output supplied model weights as const arrays, possibly with a user-specified section.

Rationale: Allows for more realistic use of device memory and allows larger models to fit.
Graph Runtime on bare metal.

Description: Make the graph runtime or full-model execution work on bare metal with a firmware-friendly interface. Without this project, models still need to be driven end-to-end by a connected TVM “supervisor” instance containing the GraphRuntime. This change enables firmware engineers to integrate TVM models into production applications.

Rationale: Supports goal #2, and allows us a chance to ensure that the on-device runtime executes graphs in the same way both during AutoTVM and during production.
Export stats from the on-device runtime.

Description: Provide RPC calls for stats like execution time, memory usage.

Rationale: Supports goal #1. Allows firmware engineers to better evaluate TVM model output and collaborate with other engineers/data scientists involved with model development.

Proof of Concept

Parts of project #1 work to some degree here. The short-term plan is to split this PoC into a couple of pieces, each with its own RFC, and discuss/merge piece by piece.

Next Steps

This roadmap is just an initial concept and we’d definitely like to work with the community to make sure this direction is useful for others. We intend to drive some of this work from OctoML, but there are a lot of tasks and there are plenty of ways to get involved.

More immediately we’d love feedback on the overall direction. The On-Device RPC Server (project #1) underpins most of the rest of the work, so we’d welcome review on our initial implementation (RFCs and PRs to come soon). Once that lands in the CI, it should be much easier to collaborate on the rest of this effort.

We’ll also have a µTVM-focused meetup on Thursday 6/18 9am PDT if you’d like to discuss in a higher-bandwidth setting. We’ll post any followup points for discussion on the forum.

manupa-arm · June 16, 2020, 1:45pm

Thanks for the RFC @areusch – especially posting this ahead of the meetup. I am trying to understand some bits around the subgoal of BYO memory allocator. Would you be able to elaborate more on this ? (as to is this about allocating tensors with memory blocks/regions/addresses ? If so are we looking at a static or dynamic allocation ? )

Moreover, not sure I entirely follow what “BYO buffer” means.

Appreciate if you can shed some light on this.

areusch · June 16, 2020, 3:45pm

Hi @manupa-arm,

Thanks for reading over the RFC!

BYO = bring your own; this would be akin to allowing developers to reimplement vmalloc. This comment mostly reflects how the CRT is organized today, but some changes may need to be made to the compilation process to make it easy to replace the supplied memory allocator. In the context of a standard allocator, “BYO buffer” means that the global buffer used by the standard allocator is supplied by the main().

For the CRT, we initially expect this would be used to dynamically allocate long-lived global state (i.e. global function registry) as well as tensors. Most of the details beyond this still need to be decided, I’ll post up an RFC on this in the near future.

Andrew

tgall_foo · June 16, 2020, 10:09pm

Thanks for putting this out Andrew. Lots to unpack, but generally looks like a good list.

On #6, I suspect a few of those stats could be gathered and reported prior to shipping down to a device. (like mbed does for instance)

On #2, I wonder if perhaps something that runs daily against master would be wise, then for code targeting uTVM there might be a CI off on the side that runs against some set of representative hardware on demand. This could be done as part of a gerrit review perhaps?

areusch · June 18, 2020, 8:50pm

For #6 (export stats), I think you’re absolutely right. I think there can be other interesting on-device stats (I.e. IRQs triggered, # function executions, etc). This is also the last one on the roadmap since it’s a bit less planned relative to the others.

On #2, I think some part should run in the pre-submit. I don’t think we should include custom hardware in the TVM presubmit for a couple of reasons:

It’s harder for contributors to reproduce errors. Only contributors with that hardware could resolve CI errors that happen there.
It’s easier for hardware to run into heisenbugs, and we shouldn’t use the presence or absence of those to gate TVM code submission.
There’s some logistical challenge around hosting the hardware for a CI (this one we can overcome, but we should think about how to place the CI for some piece of hardware closer to those with specific knowledge of that hardware, in case some offline troubleshooting is needed).

Right now for the presubmit, I’m thinking that we should run a suite of “black-box acceptance tests” against an x86 RPC server running in a child process. Those can also serve to validate the C runtime on x86, when compiled standalone.

I do think some regular automated job against hardware is important, too. I need to think a bit more about how we might put this together–open to thoughts from the community as well!