How to retarget TVM to a new ASIC chip as a device code generator?


#1

We have developed a compiler for a new AI processor, the compiler front end is based on Clang, and we add a new backend for this ASIC by LLVM.

The ASIC has two onchip memory(implemented by SPM, i.e. scratch pad memory) to buffer the neural network input, ouput and the synapse data, we call them NRAM(Neural Data RAM) and WRAM(Weight Data RAM). It also provides a large off-chip memory which to hold input and output data.

The ASIC also provide some powerful instructions to complete some nn operations, e.g. convolution, pooling, activation, matmul, transpose, mlp and etc. It also provides some variable length(not fixed size) vector instructions, e.g. vector addition, substraction, less than, greater than and etc. And some IO instructions to load data from off chip memory to on chip memory(NRAM or WRAM), and store the onchip result to off chip memory.

The high level programming language we designed is similar to CUDA, it is an extension and subset of C / C++. It provides some memory space attributes to represent the location of data, e.g. __nram__ means the data is buffered in the NRAM, __wram__ means the data is buffered in the WRAM; Besides the compiler provides a series of intrinsic functions which is similar to AVX in x86, e.g.: __sv_add will be lowered into vector addition by the llvm backend, while __conv will be lowered into the convolution instructions;

Let’s take AXPY kernel for an example, the device code is shown below:

#define LEN 1024
__global__ void AxpyKernel(half* a, half *x, half *y, half *out) {
    __nram__ half t_a[LEN];
    __nram__ half t_x[LEN];
    __nram__ half t_y[LEN];
    __nram__ half t_out[LEN];
    __load(t_a, a, LEN * sizeof(half));
    __load(t_x, x, LEN * sizeof(half));
    __sv_mul(t_out, t_a, t_x, LEN);
    __load(t_y, y, LEN * sizeof(half));
    __sv_add(t_out, t_out, t_y, LEN);
    __store(t_out, out, LEN * sizeof(half));
}

The above kernel is the basic implementation with a small data scale. However, if the input data size exceeds the on chip buffer size, highly-tuned implementations require loop tiling, vectorization, loop unrolling, double buffer to overlap IO and computation, and data prefetching and etc. Manual optimization may take weeks to complete. Moreover, if we want to implement a more complex nn operator, e.g. roi-pooling and proposal in Faster-RCNN or some new coming nn operators, it will take more engineering efforts.

Therefore, we want to retarget TVM to our ASIC. For the first step, we prefer to take TVM as an in-house code generator to emit device kernel code. I have already glanced over the TVM’s implementation, much is about GPU supporting.

If I want to port TVM into our ASIC from the scratch, much time will be consumed at runtime support, but I only prefer to take TVM as an in-house code generator to emit device kernel source code which contains the intrinsic functions or even LLVM IR or LLVM backend assembly code, so do your guys have some constructive suggestions?


#2

like the scenario of Huawei’s Davinci chip (CCE Target), which emits like CUDA code generator. Ref: [RFC] support cce target name in tvm


#3

Thanks for replying.
From the GitHub issue #1963, I got that CCE backend support plan is still in discussion :smile:
Current git repo has nothing about the CCE backend.


#4

If you already have LLVM backend, I’d suggest taking a look at TVM NVPTX and AMDGPU backend here.
https://github.com/dmlc/tvm/tree/master/src/codegen/llvm. Both use corresponding backends in LLVM to generates PTX or AMDGCN asm from TVM IR.

You also need to add runtime API for your target. Take at look at how it is implemented for CUDA and AMDGPU .


#5

@masahi Yes, thanks for your replying. But I think GPU is different from our ASIC chips. Now in our compiler model, the on chip memory e.g. NRAM and WRAM is scratch pad memory, that means they are manually managed by the programmer, how to reuse the on chip buffer and the vectorizer intrinsic functions will fluctuate the performance directly, while GPU has no vectorizer intrinsic.


#6

ok, then I suggest taking a look at tensorize primitive and VTA tutorial. I think these are the closest to what you are looking for.


#7

@masahi Okay, thanks a lot!