How to retarget TVM to a new ASIC chip as a device code generator?

We have developed a compiler for a new AI processor, the compiler front end is based on Clang, and we add a new backend for this ASIC by LLVM.

The ASIC has two onchip memory(implemented by SPM, i.e. scratch pad memory) to buffer the neural network input, ouput and the synapse data, we call them NRAM(Neural Data RAM) and WRAM(Weight Data RAM). It also provides a large off-chip memory which to hold input and output data.

The ASIC also provide some powerful instructions to complete some nn operations, e.g. convolution, pooling, activation, matmul, transpose, mlp and etc. It also provides some variable length(not fixed size) vector instructions, e.g. vector addition, substraction, less than, greater than and etc. And some IO instructions to load data from off chip memory to on chip memory(NRAM or WRAM), and store the onchip result to off chip memory.

The high level programming language we designed is similar to CUDA, it is an extension and subset of C / C++. It provides some memory space attributes to represent the location of data, e.g. __nram__ means the data is buffered in the NRAM, __wram__ means the data is buffered in the WRAM; Besides the compiler provides a series of intrinsic functions which is similar to AVX in x86, e.g.: __sv_add will be lowered into vector addition by the llvm backend, while __conv will be lowered into the convolution instructions;

Let’s take AXPY kernel for an example, the device code is shown below:

#define LEN 1024
__global__ void AxpyKernel(half* a, half *x, half *y, half *out) {
    __nram__ half t_a[LEN];
    __nram__ half t_x[LEN];
    __nram__ half t_y[LEN];
    __nram__ half t_out[LEN];
    __load(t_a, a, LEN * sizeof(half));
    __load(t_x, x, LEN * sizeof(half));
    __sv_mul(t_out, t_a, t_x, LEN);
    __load(t_y, y, LEN * sizeof(half));
    __sv_add(t_out, t_out, t_y, LEN);
    __store(t_out, out, LEN * sizeof(half));
}

The above kernel is the basic implementation with a small data scale. However, if the input data size exceeds the on chip buffer size, highly-tuned implementations require loop tiling, vectorization, loop unrolling, double buffer to overlap IO and computation, and data prefetching and etc. Manual optimization may take weeks to complete. Moreover, if we want to implement a more complex nn operator, e.g. roi-pooling and proposal in Faster-RCNN or some new coming nn operators, it will take more engineering efforts.

Therefore, we want to retarget TVM to our ASIC. For the first step, we prefer to take TVM as an in-house code generator to emit device kernel code. I have already glanced over the TVM’s implementation, much is about GPU supporting.

If I want to port TVM into our ASIC from the scratch, much time will be consumed at runtime support, but I only prefer to take TVM as an in-house code generator to emit device kernel source code which contains the intrinsic functions or even LLVM IR or LLVM backend assembly code, so do your guys have some constructive suggestions?

2 Likes

like the scenario of Huawei’s Davinci chip (CCE Target), which emits like CUDA code generator. Ref: [RFC] support cce target name in tvm

1 Like

Thanks for replying.
From the GitHub issue #1963, I got that CCE backend support plan is still in discussion :smile:
Current git repo has nothing about the CCE backend.

If you already have LLVM backend, I’d suggest taking a look at TVM NVPTX and AMDGPU backend here. https://github.com/dmlc/tvm/tree/master/src/codegen/llvm. Both use corresponding backends in LLVM to generates PTX or AMDGCN asm from TVM IR.

You also need to add runtime API for your target. Take at look at how it is implemented for CUDA and AMDGPU .

@masahi Yes, thanks for your replying. But I think GPU is different from our ASIC chips. Now in our compiler model, the on chip memory e.g. NRAM and WRAM is scratch pad memory, that means they are manually managed by the programmer, how to reuse the on chip buffer and the vectorizer intrinsic functions will fluctuate the performance directly, while GPU has no vectorizer intrinsic.

ok, then I suggest taking a look at tensorize primitive and VTA tutorial. I think these are the closest to what you are looking for.

@masahi Okay, thanks a lot!

Hi, there.
@js-lee Do you have any idea on that problem now?
I think I’m facing with similar questions to use TVM on platforms like arm-dsp system or asics. The following questions still confusing to me while coding a schedule template for a operation.

  1. How to laverage on-chip memory (like NRAM/WRAM here) with DMA transfering data.
  2. How to parallel computation and data transfer in a certain mode. E.g. double buffer or ring buffer.

TVM so far generates a very low-level scalar-based IR, tensorize can help to map it back to tensor-computation, e.g., __conv in the example. while virtual_thread can help with ILP and latency hiding.

for DMA transfer, you can refer to VTA see how to generate DMA intrinsic. or one can also tensorize cache_read.

Thanks for your suggestions!!

I learning about VTA as you suggested and it seems that DMA tasks is processed through:

  1. Lable dma task by pragma with dma_copy during scheduling
  2. Inject the lable to dma intrinsic function defined in runtime by call_extern during lowing.
    Am I right?

By the way, the backend module (op decl & sch -> tvm ir -> apply ir_pass -> codegen) is really not easy to read for me, with complex and deep class structure. I would appreciate that if you could provide some guidance on that.

1 Like

@yzhliu Yes, he is right, as mentioned before, you can refer to VTA’s implementation and tensorize.

1 Like

@js-lee I hope that the VTA documentation and tensorization tutorials helped somewhat in navigating TVM support for ASICs. If you have any feedback, let us know how we could improve our documentation to make it easier to other developers who want to build a compiler backend for their AI chips

1 Like