Arm cpu performance is too slow than mali gpu


Currently I am trying to inference VGG-16 through arm cpu.

import tvm
import tvm.relay as relay
from tvm.contrib import graph_runtime
import numpy as np
import topi
from tvm.relay.testing.temp_op_attr import TempOpAttr

target_arm_cpu ='llvm -device=arm_cpu -target=aarch64-linux-gnu')
ctx_arm_cpu =  tvm.runtime.cpu()
batch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)
mod, paramsO = relay.testing.vgg.get_workload(
    num_layers=16, batch_size=batch_size, image_shape=image_shape)
opt_level = 3

with relay.build_config(opt_level = opt_level):
    graph, lib, params = mod, target_arm_cpu , params = paramsO )

data = tvm.nd.array( np.random.uniform(-1, 1, size=data_shape ).astype("float32") , ctx_arm_cpu )
module = graph_runtime.create(graph, lib, ctx_arm_cpu)
module.set_input("data", data)

timer = module.module.time_evaluator('run',ctx_arm_cpu,number=1,repeat=2)
prof_res = np.array( timer().results )*1000
print("arm CPU -> Mean inference time (std dev): %.2f ms (%.2f ms)" %(np.mean(prof_res), np.std(prof_res)))

When I run the above code, the result is

arm CPU -> Mean inference time (std dev): 1954.49 ms (0.57 ms)

I remembered that in the old version of tvm, when vgg16 was inference, the performance was measured at about 1000ms.

The performance seems to have decreased by about twice. Is there anything I misunderstood? Or am I implementing the code incorrectly?


Does this appear because of this previous issue:

I was experiencing a similar slowdown on ARM CPUs which was lead back by the limited Winograd algorithm…

Cheers Robert

Thank you Robert! This is really useful information. Thank you.