[Autotuner] Incorrect result after tuning MobileNetV2 on ARM CPU

Hi, I am tuning a gesture recognition model based on the MobileNetV2 model and I find that after tuning using XGBTuner, the final inference result is incorrect. I used the following tuning config

tuning_option = {
        'log_filename': log_file,
        'tuner': 'xgb',
        'n_trial': 500,
        'early_stopping': 200,

        'measure_option': autotvm.measure_option(
            builder=autotvm.LocalBuilder(timeout=10),
            runner=autotvm.LocalRunner(number=10, repeat=2, timeout=4, min_repeat_ms=150),
        ),
    }

I narrowed down the config causing the issue :

{"input": ["llvm -device=arm_cpu -target=aarch64-linux-gnu", "depthwise_conv2d_nchw_spatial_pack.arm_cpu", [["TENSOR", [1, 576, 14, 14], "float32"], ["TENSOR", [576, 1, 3, 3], "float32"], [2, 2], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 2204455, "code_hash": null, "entity": [["tile_co", "sp", [-1, 2]], ["tile_oh", "sp", [-1, 1]], ["tile_ow", "sp", [-1, 7]], ["reorder_0", "re", [0, 1, 2, 3, 4, 5, 8, 6, 7]], ["reorder_1", "re", [0, 1, 2, 3, 4, 6, 5]], ["ann_reduce", "an", ["none", "unroll"]], ["ann_spatial", "an", ["unroll", "unroll", "vec"]], ["data_pad_inline", "ot", 4], ["data_vec_inline", "ot", 2], ["conv_inline", "ot", 2]]}, "result": [[6.674246947122407e-05, 6.392590855974291e-05], 0, 2.052468776702881, 1585008564.3448634], "version": 0.2, "tvm_version": "0.7.dev1"}

The correct result is detected if the above config is removed from the log.

What might be the issue?

Target device: NVIDIA Jetson TX2, ARM CPU

Tuning code : https://github.com/Ragavendrams/temporal-shift-module/blob/master/online_demo/tune_relay.py

I met this issue previously. The issue should be in the depthwise schedule of compute_at knob. The complex compute_at location combination result in issue. But this issue is very rarely and only happened some hardware platform. If you change to another LLVM version / or retune that layer, the problem maybe disappear too, at least for my previous issue.

@Rms45 Could you try the latest tvm (need to tune model again)? We have committed one pr just now and we disable this schedule currently. However, your performance maybe have a little bit downgrade.

@FrozenGene: Thank you. It seems to work now. Though the performance is affected, as you pointed out. I can get only 8GFLOPs for the depthwise tasks. And an overall speed of 8FPS for the whole application. The author of the gesture recognition model claims to have achieved around 25FPS using the ARM CPU (with TVM)

  1. Do you know anything else I can try to reach this value?
  2. Also I also noticed that i was able to reach 30-40GFLOPs easily for the tasks in earlier TVM versions (0.6 i think). Do you know why there is a sudden decrease now?

Thanks again for the help.

i think you could add ‘-mcpu=cortex-a57’ for your target.

0.6 should also use the depthwise schedule the same as the latest master now.

I think you could contact the author how to reproduce the performance (and in what version of tvm)