Codegeneration for own DLA Instruction Set

Hello everybody, I have developed an instruction set architecture for my own (for now abstract) deep learning accelerator and now want to try and build a backend, that uses TVM to translate and optimize a model description into my own assembler code.

My problem is: I am not sure, where to start, I read the PR and information about “Bring your Own Codegen”, but am still confused on where to extend TVM.

The acccelerator uses int8 inputs and weights, has dedicated instructions for convolution, relu, matmul, elementwise operations and look-up-table activation function. The data for these operations does need to come from internal SRAM and should be transferred there by the DMA.

Should I just write my own compiler passes to relay and lower it myself into the custom assembler?

1 Like

Hi:

I am also in this phase where I want to modify the existing code-gen. I would suggest you take an existing code-gen like CUDA code-gen as your reference if you have some background of CUDA programming. You can also check the passes and learn how to modify TVM IRs. After understanding the data structures(AST) of TVM IRs and fundamental classes(like StmtExprMutator, StmtExprVisitor …) and methods you will know how to Mutate IRs and thus bind specific nodes (like add nodes, Mul node, divide node) to your own ISA if your hardware can not be programmed by general programming language(e.g. C).

Hope my tips can help, I am still learning TVM framework. Comments from developers will be more beneficial. :slight_smile:

1 Like

your tips already helped a lot, as I was completely lost. I am not experienced in CUDA, but will still take a look.

Maybe we can help each other, but I am very new to TVM as well, but spend a lot of time on benchmarking various deep learning compilers and settled with TVM for now.

We have a PR up at the moment showing the BYOC integration process for our accelerator: https://github.com/apache/incubator-tvm/pull/6222. I think this may be a suitable approach for your case, our accelerator also operates in int8 (uint8). It will depend though on how interested you are in taking advantage of the scheduling language which is implemented through the lower level IR, TIR.

1 Like

I just want to point out that there is also the alternative as explained in the VTA documentation. It is an example of using the TVM functionalities end-to-end. Also it is the way you would need to go if you want to use TVM’s autotuner

1 Like

I do not want to use autotuning at this point, as my accelerator only does one version of each operation, data layout and quantization, but I will keep it in mind :slight_smile:

I do not even want to use the runtime, to be honest…

An advantage of integrating with the runtime is that you can make use of its heterogenous capabilties to offload operators that otherwise wouldn’t run on your accelerator to some other device (eg. a CPU). This can be particuarly useful for things like object detection networks where its mostly a convolution network, but has a difficult to accelerate custom operator in there as well.

1 Like

So your accelerator is able to store even the largest intermediate tensor of a specific network you are trying to compile? What happens if you want another network and this condition no longer holds?

Sure Auto-Tuning is mostly used to find fast operators, but to some degree it can also be used to find a workload split (tiling) that can actually fit on your DLA and is also fast.

1 Like

For the starting point, the blog post we wrote for BYOC may help: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm

1 Like

thank you, the blog post is really helpful!

I have started to look into the graph annotation for my accelerator, but am quite confused by attrs and args and what types are passed around.

And a couple of points, I could not really figure out after reading the blog post were:

Optimization passes:

  • how to enforce a quantization, which fits my accelerator
  • how to enforce my data format (NHWC)
  • can I implement tiling of operations, which do not fit into my memory, as pass in relay?

Memory Management/DMA:

  • my accelerator uses an internal SRAM and all functional blocks are only able to access this cache, can I add nodes to my graph to handle that?
  • how to enforce a quantization, which fits my accelerator

If the current TVM QNN flow fits your accelerator you can directly use it and take a quantized model as the input of your codegen; otherwise you could bring your own quantization. This RFC includes a feature to collect calibration data for you to perform custom quantization.

  • how to enforce my data format (NHWC)

Approach 1: You could use ConvertLayout pass to convert the layout of an entire model to NHWC.

Approach 2: You could include layout transform in your codegen. It’s more general but also more overheads.

  • can I implement tiling of operations, which do not fit into my memory, as pass in relay?

Not exactly. Relay is a graph-level IR which doesn’t include schedule information such as tiling. BYOC offloads Relay subgraphs to your codegen, which means you are in charge of the scheduling. As you can see from the blog post, your codegen will just get a subgraph like conv2d - add - relu - conv2d and you need to generate the corresponding code for your accelerator.

  • my accelerator uses an internal SRAM and all functional blocks are only able to access this cache, can I add nodes to my graph to handle that?

You cannot change Relay graph to add hardware specific nodes in your codegen, but you can customize the generated subgraph code to achieve this goal. For example, Relay offloads subgraph1 from the graph conv2d - subgraph1(conv2d - add - relu - conv2d) - dense to your codegen, and your codegen may generate a JSON representation for subgraph1 with cache_read - conv2d - add - relu - conv2d - cache_write. Different from Relay graph, the JSON representation can be totally customized, because the whole graph is still conv2d - subgraph1 - dense from TVM/Relay’s point of view.

1 Like

thank you very much for your support! I was finally able to set up FFI in my environment, but am still not able to understand the difference between attr and args in the DNNL and ACL examples at python/tvm/relay/contrib?

I suspect attr are the arguments on the relay ops, which I want to annotate? But what are args?

I would say your understanding of attrs is correct. They are the parameters which are used to specialize the Relay operators. args are the actual numerical inputs/outputs expected by the Relay operator. One simple example in the case of a conv2d are the IFM, kernels and OFM tensors.

1 Like

thank you :slight_smile:

I have another problem:

The graph annotation has been defined (by adding my own version at python/tvm/relay/op/contrib/test_dla.py)

But it seems like that is not enough to get the annotation going, as

mod_t = transform.AnnotateTarget("test_dla")(mod)

followed by

mod_t = transform.PartitionGraph()(mod_t)

results in no change to the representation

what did I miss, to enable my specific annotation?

The graph in question looks like this:

def @main(%x: Tensor[(10, 10), int8], %y: Tensor[(10, 10), int8]) -> Tensor[(10, 10), int8] {
  %0 = multiply(%y, %y) /* ty=Tensor[(10, 10), int8] */;
  %1 = add(%x, %x) /* ty=Tensor[(10, 10), int8] */;
  subtract(%0, %1) /* ty=Tensor[(10, 10), int8] */
}

And my annotations: (I trief replacing qnn.add with add and the same for subtract, but maybe I have forgotten to register my annotation somewhere?)

@tvm.ir.register_op_attr("qnn.add", target_name)
def add(attr, args):
    ''' check if tensor addition is supported by DLA'''
    typ = args[0].checked_type

    if typ.dtype != "int8":
        return False

    #TODO: how to test for equal shapes?
    return True

@tvm.ir.register_op_attr("qnn.subtract", target_name)
def sub(attr, args):
    ''' check if tensor addition is supported by DLA'''
    typ = args[0].checked_type

    if typ.dtype != "int8":
        return False

    #TODO: how to test for equal shapes?
    return True