How to realize device specific memory allocation

Hi,

There is application scenario in my side. We want to do memory zero copy to guarantee efficiency. In this case, I/O memory which is allocated in user app should be leveraged directly by target device at runtime.

But target device requires that the I/O memory should be physically contiguous. With ctx=kDLCPU, the allocated memory is not physically contiguous.

So I wonder whether there is a solution for me to achieve memory zero copy for my specific target device?

Options I can think of:

  1. add support to alloc physically contiguous memory with device ctx=kDLCPU
  2. define a specific device type from DeviceAPI and do the memory alloc/free management based on what target device request

Please help suggest.

Actually we need CMA for my specific target. Currently is it supported in TVM?

IMHO, the second option seems more reasonable in terms of the semantic. In this case, you have a full control on memory.

cc @zhiics @haichen @tqchen

The second option looks better to me as well. But my concern is that we may have many such devices/targets. I am not sure what is the best way to register and maintain them (downstream vs upstream).

As far as I see, both PYNQ and De10nano devices requires CMA. CMA is a common request by some targets which run embedded linux OS. Do you have plan to support such kind of devices with CMA support? If there is device type that support CMA, I think we can use that.

I understand your concern. If there is a generic device type that supports CMA (for example, DeviceCMA), we can just use this device type. And we don’t have to register MySpecificDevice to TVM.

In this way, other devices which require CMA can also use this DeviceCMA.

@tqchen Can you comment on this? Thanks.

device pinned memory should be different from the normal memory and thus belongs to a different runtime type code, similar to the CPUPinned for the case of CUDA.

@tqchen, Thanks for your comment.

Shall I take CPUPinned for example and define a CPUCMA device which provides CMA generic support?
And if it’s doable, shall I update this support to TVM upstream?

Please help suggest.

There is one more requirement for memory allocation with my specific device. Allocated memory should be 32-byte aligned. I didn’t find a parameter that I can assign the alignment size in NDArray or DLTensor.

So I think one generic CPUCMA device may not be enough. We also need to consider the 32-byte alignment issue.

I guess I have to define one specific device to handle all the memory allocation requirements.

Would you help comment?