Use Tensorize to support new NPU

Now , we have a new NPU and we want to use tensorize to extend TVM to support our NPU. But unfortunately we meet a problem. The following is the schedule after tensorize:

for (co.init, 0, 4) {
    tvm_call_packed("tvm.contrib.bmcv.memset", 0, tvm_address_of(C_buf[(co.init*16)]))
  }
  for (ko, 0, 4) {
    for (co, 0, 4) {
      tvm_call_packed("tvm.contrib.bmcv.gemm.forward", (uint1)0, (uint1)1, 1, 16, 16, 1.000000f, tvm_address_of(A[(ko*16)]), 0, tvm_address_of(B[((co*1024) + (ko*256))]), 0, 0.000000f, tvm_address_of(C_buf[(co*16)]), 0)
    }
  }

We can see that tvm_address_of(A[(ko16)])* will set the offset of the address of A in each loop. but for our NPU , A’s address is not the NPU’s address. A is a struct that contains the NPU’s address. So A[(ko16)]* doesn’t change the NPU’s address, just change the struct address. Is there any method that we can control the offset in A’s member address, not A’s address? So Any one can help me?

@yzhliu @tqchen

What do you mean member address and struct address? Can you show the code that you have and the code you expect?

As we know, we should add new xxx_device_api.cc file for new hardware.

The following is the AllocDataSpace function for new hardware in my xxx_device_api.cc

void* AllocDataSpace(TVMContext ctx,
                                 size_t nbytes,
                                 size_t alignment,
                                 TVMType type_hint) final
            {
                LOG(INFO) << "############bmdnn alloc data space#############nbytes:" << nbytes;
                bm_handle_t handle = GetBmHandle(ctx.device_id);
                bm_device_mem_t* pmem = new bm_device_mem_t();
                bm_status_t ret = bm_malloc_device_byte(handle, pmem, nbytes);
                if(BM_SUCCESS != ret)
                {
                    LOG(ERROR) << "bm_malloc_device_byte failed";
                    return NULL;
                }
                LOG(INFO) << "alloc data pmem address "  << pmem;
                return pmem;
            }

you can see the return value is the bm_device_mem_t type. the bm_device_mem_t is that struct I said. The following is the bm_device_mem_t

typedef struct bm_mem_desc {
	union {
		struct {
			unsigned long         device_addr;
			unsigned int         revesved;
			int         dmabuf_fd;
		} device;
		struct {
			HOST_MEM    host_mem;
			unsigned int reserved0;
			int         reserved1;
		} host;
		struct {
			void *      system_addr;
			unsigned int reserved0;
			int         reserved1;
		} system;
	} u;

	bm_mem_flags_t         flags;
	unsigned int                    size;
} bm_mem_desc_t;

typedef struct bm_mem_desc   bm_device_mem_t;

we can see that bm_device_mem_t has a member that is device_addr. this device_addr is that member address I said. Do you understand? I’m sorry that my expression is not clear

Please figure out if I misunderstand anything. If my understanding is correct, you want to use your struct to build a buffer. In traditional C++ code, we usually have int A[1024], which is an int buffer with size 1024. For now, you want to have a buffer like bm_mem_desc A[1024] and you want to control the member of that struct. Is that right?

I am afraid we do not have such API since it is very rare in traditional hardware even on accelerators. Can you calculate the device_addr using struct offset? If so, you can pass Buffer.data and Buffer. elem_offset separately (see https://docs.tvm.ai/api/python/tvm.html#tvm.decl_buffer) in your packed function.

Thanks for your replay。 Our new hardward just has one type of memory. So we don’t want to use buffer in tensorize. So is there any others method to resovle my problem?

I have knew how to resolve my problem. The BufferNode has the elem_offset member. so I can directly use this member. I was so stupid.:joy: