I am trying to tune MobileNetV3-minmalistic TF Lite Model with TVM targeting arm_cpu. And there are some issues.
1) Configuration fallback
I measured the inference time of TF Lite model with or without autoTVM. And result of autoTVM is worse than before using autoTVM.
- Inference latency with TF Lite Benchmark : 26.7 ms (0.07 ms)
- Inference latency with autoTVM : 28.62ms (2.49 ms)
I can see lots of warning log with fallback configuration like below. I think these fallback cause performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 3, 225, 225, 'float32'), (16, 3, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('depthwise_conv2d_nchw', (1, 16, 114, 114, 'float32'), (16, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 16, 112, 112, 'float32'), (16, 16, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 16, 112, 112, 'float32'), (64, 16, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('depthwise_conv2d_nchw', (1, 64, 113, 113, 'float32'), (64, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Since I use snapdragon855, I use target same as pixel2 (pixel2 also has snapdragon) like below. I also replace with other targets (like rk3399), but it still fall backs. Did I made wrong target? If then, is there proper target for compile TF Lite model in arm_cpu?
target = tvm.target.create(
'llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android')
2) n_trial doesn’t work after number of tasks.
[Task 1/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (5/5) | 10.23 s Done.
[Task 2/36] Current/Best: 1.17/ 1.17 GFLOPS | Progress: (5/5) | 11.56 s Done.
[Task 3/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (5/5) | 10.30 s Done.
[Task 4/36] Current/Best: 0.49/ 0.49 GFLOPS | Progress: (1/1) | 0.62 s Done.
[Task 5/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 10.19 s Done.
[Task 6/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.20 s Done.
[Task 7/36] Current/Best: 0.29/ 0.29 GFLOPS | Progress: (1/1) | 0.98 s Done.
[Task 8/36] Current/Best: 2.34/ 2.34 GFLOPS | Progress: (1/1) | 9.23 s Done.
[Task 9/36] Current/Best: 0.85/ 0.85 GFLOPS | Progress: (1/1) | 0.71 s Done.
[Task 10/36] Current/Best: 6.45/ 6.45 GFLOPS | Progress: (1/1) | 1.84 s Done.
[Task 11/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.15 s Done.
[Task 12/36] Current/Best: 0.96/ 0.96 GFLOPS | Progress: (1/1) | 0.75 s Done.
[Task 13/36] Current/Best: 2.01/ 2.01 GFLOPS | Progress: (1/1) | 2.87 s Done.
[Task 14/36] Current/Best: 1.10/ 1.10 GFLOPS | Progress: (1/1) | 1.09 s Done.
[Task 15/36] Current/Best: 0.51/ 0.51 GFLOPS | Progress: (1/1) | 0.56 s Done.
[Task 16/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.29 s Done.
[Task 17/36] Current/Best: 1.56/ 1.56 GFLOPS | Progress: (1/1) | 1.52 s Done.
[Task 18/36] Current/Best: 0.51/ 0.51 GFLOPS | Progress: (1/1) | 0.49 s Done.
[Task 19/36] Current/Best: 1.15/ 1.15 GFLOPS | Progress: (1/1) | 5.11 s Done.
[Task 20/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.15 s Done.
[Task 21/36] Current/Best: 0.55/ 0.55 GFLOPS | Progress: (1/1) | 0.54 s Done.
[Task 22/36] Current/Best: 4.15/ 4.15 GFLOPS | Progress: (1/1) | 1.23 s Done.
[Task 23/36] Current/Best: 1.50/ 1.50 GFLOPS | Progress: (1/1) | 1.46 s Done.
[Task 24/36] Current/Best: 0.69/ 0.69 GFLOPS | Progress: (1/1) | 0.67 s Done.
[Task 25/36] Current/Best: 1.26/ 1.26 GFLOPS | Progress: (1/1) | 0.92 s Done.
[Task 26/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.11 s Done.
[Task 27/36] Current/Best: 0.37/ 0.37 GFLOPS | Progress: (1/1) | 0.55 s Done.
[Task 28/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.17 s Done.
[Task 29/36] Current/Best: 1.01/ 1.01 GFLOPS | Progress: (1/1) | 0.77 s Done.
[Task 30/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.08 s Done.
[Task 31/36] Current/Best: 1.22/ 1.22 GFLOPS | Progress: (1/1) | 0.73 s Done.
[Task 32/36] Current/Best: 1.64/ 1.64 GFLOPS | Progress: (1/1) | 0.77 s Done.
[Task 33/36] Current/Best: 2.70/ 2.70 GFLOPS | Progress: (1/1) | 2.52 s Done.
[Task 34/36] Current/Best: 1.39/ 1.39 GFLOPS | Progress: (1/1) | 0.97 s Done.
[Task 35/36] Current/Best: 1.42/ 1.42 GFLOPS | Progress: (1/1) | 0.59 s Done.
[Task 36/36] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (1/1) | 0.19 s Done.
I configure n_trial in tuning_option to 5 for testing code works without error. But as you can see in tuning progress, n_trial becomes to 1 after Task 3. I find out
n_trial = min(n_trial, len(tsk.config_space))
length of tsk.config_space is 1 and it makes n_trial 1. Maybe it’s similar problem with Q1, but what makes tsk.config_space 1?
3) Layout Problem
TF Lite model uses NHWC data layout. I verified module returns from from_tflite() have data_layout=“NHWC” like below.
%0 = nn.pad(%input, pad_width=[[0, 0], [0, 1], [0, 1], [0, 0]]) /* ty=Tensor[(1, 225, 225, 3), float32] */;
%1 = nn.conv2d(%0, %v_param_1, strides=[2, 2], channels=16, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 112, 112, 16), float32] */;
%2 = nn.bias_add(%1, %v_param_2, axis=3) /* ty=Tensor[(1, 112, 112, 16), float32] */;
But as you can see fallback log in Q1, autoTVM tries to find config with NCHW layout. Is there way to tune ops only using NHWC layout?