Here is my code, testing on CPU of i7 with func built on target llvm.
w=1280
h=720
c=3
target = 'llvm'
a_np = 255. * np.random.uniform(size=(h, w, c)).astype(np.float32)
ctx = tvm.context(target, 0)
tic = time()
for i in range(10):
a_tvm = tvm.nd.array(a_np, ctx=ctx)
print('np to nd: %.4f ms' % ((time() - tic) * 1000 / 10))
#func(a_tvm, d_tvm)
It prints:
np to nd: 1.9820 ms
Since llvm is still on cpu, this time is much lower than expected (compare with cuda target, which needs cpu to gpu conversion), and some small model only computes for less than 2 ms.
Are there any suggestions?