TVM Compiler and MAERI Support

BracketMaster · January 12, 2020, 7:55pm

Hello There, I’m a Georgia Tech graduate student working in the Synergy Lab. We focus on discrete machine learning accelerators and sometime ago, we released an ML accelerator architecture called MAERI.

We have implemented it in an FPGA and wish to add support for arbitrary CNN models.

I am fascinated with TVM - but am wondering if it supports a particular use case that we need.

In MAERI, a 3x3 * 128x128 convolution requires 3x3 = 9 mults. So if a particular instantiation of MAERI has 64 mults, that means I can support convolutions with 3x3, 4x4, and 5x5 kernels all in parallel.

Is there a way to make TVM aware of this - that is - can I write ordering rules that support this?

Often in CNNs, a particular layer can have multiple outputs. Consider a ML layer that has 4 outputs. The physical constraints our the MAERI accelerator might only be support finishing 3 of those 4 outputs in parallel. This leave 1 output to be computed by itself. So while this 1 output is being computed, the MAERI accelerator can begins evaluating outputs of the next layer(that depend on the 3 outputs we just computed) in parallel.

Given a Keras model, and the physical constraints of MAERI, can I have TVM determine valid ordering of a CNN? I have browsed through the tutorials - but I don’t think the tutorials answered this question.

Yehowshua

masahi · January 13, 2020, 3:22am

cc @thierry he can probably answer your question.

thierry · January 13, 2020, 6:22am

Hi @BracketMaster, it’s interesting work you’re introducing here! I believe you can implement a simple Relay pass that can iterate through a network and based on rules that you set, determine how many layers can be evaluated in pipelined fashion with MAERI.

There is some documentation on how to write a Relay pass here: https://docs.tvm.ai/tutorials/dev/low_level_custom_pass.html

Next there is code on how to import a Keras model into Relay: https://docs.tvm.ai/tutorials/frontend/from_keras.html

Hope this helps.

Thierry

aca88 · January 13, 2020, 8:01am

Hey there,

I would say there are (at high level) two ways of doing what you want.

Coarse grain programming: This is mostly relevant if your MAERI’s ISA is very similar to the TOPi operator library. In this case, you could most likely overload the schedules for your specific accelerator (to be more specific here, you would not go into the Tensor Expression (TE) representation of TVM)
- In their most simple form, a complete TOPi operator will be one “call” to the instruction construction/dispatch of the MAERI “instruction stream”
- You will probably still have to deal with some aspects of the codegen, but I would guess this would be minimal and most of your efforts would most likely be in bridging Relay/TOPi representation to your ISA
- Example: how TVM targets CuDNN or any other Lin Alg library
Fine grain programming: This is what you will need to do if you want to map a low level ISA to TOPi operators. I am not an expert in MAERI, but the obvious example I can give is the VTA. VTA requires an instruction stream to represent a simple form of conv2d, because its ISA is lower level (more general). In this problem, you could expose “the way” to do the loop level transformations to fit the finer granularity of your accelerator. Once this is done, you can “call” the specific implementations of generating the bit representation of your instructions, etc.

Although this sounds similar to the previous point, it MAY require the following added complexity:

The set of low level loop transformations currently supported by the main branch have some limitations. Nonetheless, TVM does give you liberty to write your own low level passes which might sidestep some of these limitations
It requires you to work at the TE representation and it’s Abstract Syntaxt Tree (AST) which is considerably more complex than the Relay/TOPi representation
Example: VTA

Some of the advantages of this last option is that you may profit from some of the low level optimizations that TVM has already in the TE level. One obvious example is to minimize the number of operations/data transmissions by eliminating dead code and other memory size checks. Another not so obvious advantage is to be able to use the AutoTVM functionality (which to my understanding works only in the TE level).

Illustrative example: Assume you have a conv2d parametrized in some way that the output tensor H,W dimensions are odd (as in not uniquely divisible by 2, not that they are strange) followed by a factor 2 pooling layer with padding='valid' parameter.

IF you declare this as one schedule, TVM should optimize the H, W dimensions to drop the computations of “one” row and column since, they lead to no outputs after pooling
IF you map some TVM arrays to some local memory regions and you know the limits of these regions then you can build in extra compiler checks to make sure stuff runs on your system
IF you design the mapping steps as part of the supported AutoTVM scheduling knobs, you can allow for workload-specific automatic schedule space exploration
Please note all the IFs, meaning that TVM does somethings out-of-the-box but most of the time for specialized HW you need to incorporate the “differentiating factors” into the compiling process in order to make use of them during compiling (this should be obvious).

The first of the options is what I would almost refer to “integrating a hand coded library into TVM to support a loosely coupled accelerator”. The second I would refer to “integrating valid mapping concepts of your HW into TVMs codegen, in order to reuse most of the TVM stack and allow fine grain programming of a tightly coupled/ fine grain accelerator”

Disclaimer: Most of what I have exposed here is solely taking into account the compilation flow and not the runtime/deployment flow.

BracketMaster · July 23, 2020, 6:29am

Hi again. So I’ve started working on using TVM again. I think I will have to use option 2 from what you mentioned.

MAERI has three primitives. Con2d, Matmul, and vec add. With these primitives, it is possible to evaluate most NNs.

Is there some code showing how VTA mapped its ISA to TOPi as well as how it generates its custom bitstream?