Numpy array to tvm nd array is so slow

Here is my code, testing on CPU of i7 with func built on target llvm.

    w=1280
    h=720
    c=3
    target = 'llvm'
    a_np = 255. * np.random.uniform(size=(h, w, c)).astype(np.float32)
    ctx = tvm.context(target, 0)
    tic = time()
    for i in range(10):
        a_tvm = tvm.nd.array(a_np, ctx=ctx)
    print('np to nd: %.4f ms' % ((time() - tic) * 1000 / 10))
    #func(a_tvm, d_tvm)

It prints:

np to nd: 1.9820 ms

Since llvm is still on cpu, this time is much lower than expected (compare with cuda target, which needs cpu to gpu conversion), and some small model only computes for less than 2 ms.

Are there any suggestions?

if you really care about copy perf, you might want to use zero-copy from numpy to tvm ndarray.

Thanks for the reply! Could you mind giving some tips to do zero copy to tvm? I didn’t find api to do so. @junrushao1994

There is tvm.nd.numpyasarray which returns a TVMArray. I must admit I don’t fully understand the lifetime organization of trying to make an NDArray from it.

Note that 10 iterations are really not that much and I get a significantly faster time per loop when I do 100. Just to see what the speed should be like: going through PyTorch’s zero copy and DLPack gives

a_pt = torch.from_numpy(a_np) 
a_dl = torch.utils.dlpack.to_dlpack(a_pt) 
a_tvm = tvm.nd.from_dlpack(a_dl) 

is about 90x as fast as copying for me.

3 Likes

@t-vi Thanks a lot!!! This really helps! Now the data conversion time can be almost ignored.

BTW, I tested tvm.nd.numpyasarray and found that its converted data type is weird and cannot be used directly, though its fast, shown as follows:

>>> y = tvm.nd.array(x)
>>> type(y)
<class 'tvm.runtime.ndarray.NDArray'>
>>> z = tvm.nd.numpyasarray(x)
>>> type(z)
<class 'tuple'>
>>> z
(<tvm._ffi.runtime_ctypes.TVMArray object at 0x7fef341a3400>, <tvm._ffi.base.c_long_Array_1 object at 0x7fedf3624510>)

So only the later solution can be used. I really wish there could be a tvm native api for fast zero copy functionality, instead of rely on torch and dlpack!

Also, looks like we cannot assign ctx in from_dlpack api, so we might need to use copyto to copy data into target device? Then only llvm could work in this case and other target would cost much time as well, do you have any suggestions? Thanks a lot!

So PyTorch thinks there is only one CPU, but for GPUs the tensor necessarily stays on the device it is on (I hope they count the devices in the same way). Someone more knowledgeable than me would have to chime in as to whether tvm.nd.numpyasarray is the right function or whether there should be something giving an NDArray. But again, I think the most tricky part is memory ownership.