Autotuning inference time is consistently worse than untuned

I have a model I’m using, where the autotuned value is consistently worse than the untuned on ARM devices.

For example, when running untuned, this ONNX model takes 4376ms on an ARM device of mine. The autotuned time is 4667ms.

When autotuning this model, I use XGBoost with knobs. The time is around this value or worse regardless of the number of steps. 2000, 5000, 10000. The time is always at least 150ms slower.

There are no other models I have found this behavior with, using the same tuning parameters and device target strings. They always improve.

This has been tested with older and newer versions of TVM.

The model is available here as an ONNX file. It takes float32 input of shape [1, 3, 224, 224] and has input data called input_tens.

Any idea why this might be happening?