[BUG][ARM] Significant performance degradation of execution times between TVM revisions

Dear all, when updating to the latest master (from a rather old TVM revision) I started to experience massive execution time degradation on ARM. After using git bisect, I could trace the problem to one specific revision:

  • Last “good” commit : c4c61cb766608fb2f0fd8c9facc480a43afed3f5 (link [Fix] Fix get_valid_count flaky test for cuda (#4901))
  • First “bad” commit : 623dd2087839b76bf7950f0759d5d8746497f2b7 (link [Relay][AutoTVM] Relay op strategy (#4644))
  • Latest master (“bad”) : 38118befc0a7e8a3db87d652b30a9369abb60363 (link [ConvertLayout] Support QNN ops. (#5066))

To give a minimal example for reproducing the issue, I have used the Auto-TVM tutorial for ARM.

I have done though certain modifications to the code to adapt to my environment and board (dual-core ARM A72 platform). I rather commented out Auto-TVM and just used a normal RPC runner to have a minimal example, which is still reproducible with multiple networks. Here are my modifications (nothing conspicuous I think):

...

from tvm import rpc
...

os.environ['TVM_NDK_CC'] = "/home/shared/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc"
... 

use_android = True
...

# Simply commenting out Auto-TVM
# tasks = autotvm.task.extract_from_program(mod["main"], target=target,
#                                           params=params,
#                                           ops=(relay.op.get("nn.conv2d"),))

# # run tuning tasks
# print("Tuning...")
# tune_tasks(tasks, **tuning_opt)

...

# Replacing the tracker with a simple rpc runner
# remote = autotvm.measure.request_remote(device_key, '0.0.0.0', 6999,          
remote = rpc.connect('192.168.1.3', 6999)

...

# Increasing measurement precision
ftimer = module.module.time_evaluator("run", ctx, number=10, repeat=10)

Next, I executed inference for a set of the possible networks of the tutorial to measure runtime. First I do this using the “good” revision of TVM, and afterwards fetching the latest master and re-building TVM. In between I have NOT touched the tutorial code. Here are the run-times:

################
"Good" revision: 
################
resnet-18
Mean inference time (std dev): 221 ms (0.45ms)

VGG-16:
Mean inference time (std dev): 1800.93 ms (3.43 ms)

mobilenet:
Mean inference time (std dev): 81.63 ms (0.39 ms)

squeezenet_v1.1:
Mean inference time (std dev): 81.78 ms (0.34 ms)

inception_v3:
Mean inference time (std dev): 1112.22 ms (2.27 ms)

##############
Latest master: 
##############
resnet-18
Mean inference time (std dev): 1425.33 ms (6.12 ms)

VGG-16:
Mean inference time (std dev): 5268.69 ms (15.55 ms)

mobilenet:
Mean inference time (std dev): 187.90 ms (1.09 ms)

squeezenet_v1.1:
Mean inference time (std dev): 108.64 ms (1.04 ms)

inception_v3:
Mean inference time (std dev): 1500.82 ms (6.01 ms)

As you can see, there is a significant performance degradation present, which is reproducible by only using another revision of TVM. Since I don’t know the TVM code base very well, I’m puzzled what did go wrong in that particular “bad” commit. Maybe the developers of that particular revision @kevinthesun @haichen could help clarifying the issue? Maybe I do something wrong?

Any kind of help and ideas are very much appreciated! Thank you & Best regards!

Thanks for reporting the issue. Could you share the compilation log using the master?

Hello! Thanks for the quick reaction!

First, here is my cmake configuration file (link might expire).

Next, this is my cmake config log (link might expire).

And here is my ‘make VERBOSE=1’ compilation log (link might expire).

Anything suspicious?

Thanks a lot for looking into it! Best regards, Robert

Hi Robert, thanks for the response. I’d like to see the log of the autotvm tutorial scripts, not the TVM compilation log. Could you share that?

Hello! Sorry for the misunderstanding, here is the log:

Extract tasks...
Compile...
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (1000, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (4096, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 25088), 'float32'), ('TENSOR', (4096, 25088), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 5231.24 ms (28.77 ms)

The warnings arise due to using the default schedule (instead of Auto-TVM) I think. These also arise with the “working” version of TVM.

Best regards, Robert

Just to be on the safe side, here is the log for the “working” revision. Same application as before.

Extract tasks...
Compile...
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 3, 224, 224, 'float32'), (64, 3, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 64, 224, 224, 'float32'), (64, 64, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 64, 112, 112, 'float32'), (128, 64, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 128, 112, 112, 'float32'), (128, 128, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 128, 56, 56, 'float32'), (256, 128, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 256, 56, 56, 'float32'), (256, 256, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 256, 28, 28, 'float32'), (512, 256, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 512, 28, 28, 'float32'), (512, 512, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 3, 3, 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (1000, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (4096, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 25088, 'float32'), (4096, 25088, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 1807.57 ms (7.68 ms)

Cheers, Rob

One quick question. Did you do auto tuning for your model? Because you mentioned that you ran the auto tuning script, it should not have so many missing config warning log after you tune the model.

Hello!

No I did not. Maybe you can scroll up a little to my original post - as you can see everything auto-tvm related is commented out. I’m just (mis)using the script to perform inference on a chosen network.

Could you apply this patch and see if it fixes the performance issue?

diff --git a/python/tvm/relay/op/strategy/arm_cpu.py b/python/tvm/relay/op/strategy/arm_cpu.py
index 0945f5179..bbb86d8e6 100644
--- a/python/tvm/relay/op/strategy/arm_cpu.py
+++ b/python/tvm/relay/op/strategy/arm_cpu.py
@@ -67,7 +67,7 @@ def conv2d_strategy_arm_cpu(attrs, inputs, out_type, target):
                         wrap_compute_conv2d(topi.arm_cpu.conv2d_nchw_winograd),
                         wrap_topi_schedule(topi.arm_cpu.schedule_conv2d_nchw_winograd),
                         name="conv2d_nchw_winograd.arm_cpu",
-                        plevel=15)
+                        plevel=5)
                     if "nnpack" in target.libs and pt == 1 and pb == 1 and pl == 1 and pr == 1:
                         strategy.add_implementation(
                             wrap_compute_conv2d(topi.arm_cpu.conv2d_nchw_winograd_nnpack),

Hello! Wow! The runtimes reduced to as previously!

Extract tasks...
Compile...
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (1000, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (4096, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 25088), 'float32'), ('TENSOR', (4096, 25088), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 1785.88 ms (5.61 ms)

Could you please tell me more about this black magic? :slight_smile: Also: is this a bug? Is it some kind of incompatibility of TVM settings and my ARM board? How to proceed @haichen ?

Thanks for looking into this & Best regards, Rob

This is because previously Relay only uses the basic conv2d algorithm except you explicitly ask it to use Winograd algorithm. After the op strategy, Relay op strategy prioritizes Winograd algorithm as Haichen pointed out. plevel=15 makes the priority of using Winograd algorithm higher than basic algorithm (plevel=10). As a result, the performance regression you observed was because the TOPI schedule for Winograd algorithm is not less efficient than the one for the basic conv2d algorithm.

We assigned Winograd algorithm a higher priority because we supposed Winograd algorithm should bring us higher performance in general, but it turns out that the TOPI schedule for Winograd algorithm has to be further improved.

1 Like

Thank you @comaniac for the thorough explanation! My last question (also to @haichen) : how to proceed from here.

  1. Bug scenario: Since I was using the ‘stock’ tutorial example with the latest master, the runtime degradation might be considered a bug in TVM. Requiring to lower the priorization of the Winograd algorithm (as Haichen proposed in the diff) until it is fixed.

  2. Unsupported feature scenario: Since currently there is a limitation with the Winograd algorithm, it should be avoided on the user level. Can this be done explicitly in the tutorial code? If so, this would leave the TVM source code unmodified, and the tutorial code would require an update (again until Winograd is fixed).

What do you think? Best regards, Robert

@haichen‘s PR for fixing this has been merged (https://github.com/apache/incubator-tvm/issues/5118) so your first point has been achieved.

Oh wow! That is great!! Thank you very much @haichen and @comaniac! The TVM community is simply amazing :slight_smile:

Cheers, Rob