Tuning MobileNetV3 TF Lite model on arm_cpu

I am trying to tune MobileNetV3-minmalistic TF Lite Model with TVM targeting arm_cpu. And there are some issues.

1) Configuration fallback

I measured the inference time of TF Lite model with or without autoTVM. And result of autoTVM is worse than before using autoTVM.

  • Inference latency with TF Lite Benchmark : 26.7 ms (0.07 ms)
  • Inference latency with autoTVM : 28.62ms (2.49 ms)

I can see lots of warning log with fallback configuration like below. I think these fallback cause performance regression.

Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 3, 225, 225, 'float32'), (16, 3, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('depthwise_conv2d_nchw', (1, 16, 114, 114, 'float32'), (16, 1, 3, 3, 'float32'), (1, 1), (0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 16, 112, 112, 'float32'), (16, 16, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('conv2d', (1, 16, 112, 112, 'float32'), (64, 16, 1, 1, 'float32'), (1, 1), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android, workload=('depthwise_conv2d_nchw', (1, 64, 113, 113, 'float32'), (64, 1, 3, 3, 'float32'), (2, 2), (0, 0), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Since I use snapdragon855, I use target same as pixel2 (pixel2 also has snapdragon) like below. I also replace with other targets (like rk3399), but it still fall backs. Did I made wrong target? If then, is there proper target for compile TF Lite model in arm_cpu?

target = tvm.target.create(
    'llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android')

2) n_trial doesn’t work after number of tasks.

[Task  1/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (5/5) | 10.23 s Done.
[Task  2/36]  Current/Best:    1.17/   1.17 GFLOPS | Progress: (5/5) | 11.56 s Done.
[Task  3/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (5/5) | 10.30 s Done.
[Task  4/36]  Current/Best:    0.49/   0.49 GFLOPS | Progress: (1/1) | 0.62 s Done.
[Task  5/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 10.19 s Done.
[Task  6/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.20 s Done.
[Task  7/36]  Current/Best:    0.29/   0.29 GFLOPS | Progress: (1/1) | 0.98 s Done.
[Task  8/36]  Current/Best:    2.34/   2.34 GFLOPS | Progress: (1/1) | 9.23 s Done.
[Task  9/36]  Current/Best:    0.85/   0.85 GFLOPS | Progress: (1/1) | 0.71 s Done.
[Task 10/36]  Current/Best:    6.45/   6.45 GFLOPS | Progress: (1/1) | 1.84 s Done.
[Task 11/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.15 s Done.
[Task 12/36]  Current/Best:    0.96/   0.96 GFLOPS | Progress: (1/1) | 0.75 s Done.
[Task 13/36]  Current/Best:    2.01/   2.01 GFLOPS | Progress: (1/1) | 2.87 s Done.
[Task 14/36]  Current/Best:    1.10/   1.10 GFLOPS | Progress: (1/1) | 1.09 s Done.
[Task 15/36]  Current/Best:    0.51/   0.51 GFLOPS | Progress: (1/1) | 0.56 s Done.
[Task 16/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.29 s Done.
[Task 17/36]  Current/Best:    1.56/   1.56 GFLOPS | Progress: (1/1) | 1.52 s Done.
[Task 18/36]  Current/Best:    0.51/   0.51 GFLOPS | Progress: (1/1) | 0.49 s Done.
[Task 19/36]  Current/Best:    1.15/   1.15 GFLOPS | Progress: (1/1) | 5.11 s Done.
[Task 20/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.15 s Done.
[Task 21/36]  Current/Best:    0.55/   0.55 GFLOPS | Progress: (1/1) | 0.54 s Done.
[Task 22/36]  Current/Best:    4.15/   4.15 GFLOPS | Progress: (1/1) | 1.23 s Done.
[Task 23/36]  Current/Best:    1.50/   1.50 GFLOPS | Progress: (1/1) | 1.46 s Done.
[Task 24/36]  Current/Best:    0.69/   0.69 GFLOPS | Progress: (1/1) | 0.67 s Done.
[Task 25/36]  Current/Best:    1.26/   1.26 GFLOPS | Progress: (1/1) | 0.92 s Done.
[Task 26/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.11 s Done.
[Task 27/36]  Current/Best:    0.37/   0.37 GFLOPS | Progress: (1/1) | 0.55 s Done.
[Task 28/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.17 s Done.
[Task 29/36]  Current/Best:    1.01/   1.01 GFLOPS | Progress: (1/1) | 0.77 s Done.
[Task 30/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.08 s Done.
[Task 31/36]  Current/Best:    1.22/   1.22 GFLOPS | Progress: (1/1) | 0.73 s Done.
[Task 32/36]  Current/Best:    1.64/   1.64 GFLOPS | Progress: (1/1) | 0.77 s Done.
[Task 33/36]  Current/Best:    2.70/   2.70 GFLOPS | Progress: (1/1) | 2.52 s Done.
[Task 34/36]  Current/Best:    1.39/   1.39 GFLOPS | Progress: (1/1) | 0.97 s Done.
[Task 35/36]  Current/Best:    1.42/   1.42 GFLOPS | Progress: (1/1) | 0.59 s Done.
[Task 36/36]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 0.19 s Done.

I configure n_trial in tuning_option to 5 for testing code works without error. But as you can see in tuning progress, n_trial becomes to 1 after Task 3. I find out

n_trial = min(n_trial, len(tsk.config_space))

length of tsk.config_space is 1 and it makes n_trial 1. Maybe it’s similar problem with Q1, but what makes tsk.config_space 1?

3) Layout Problem

TF Lite model uses NHWC data layout. I verified module returns from from_tflite() have data_layout=“NHWC” like below.

%0 = nn.pad(%input, pad_width=[[0, 0], [0, 1], [0, 1], [0, 0]]) /* ty=Tensor[(1, 225, 225, 3), float32] */;
  %1 = nn.conv2d(%0, %v_param_1, strides=[2, 2], channels=16, kernel_size=[3, 3], data_layout="NHWC", kernel_layout="HWIO") /* ty=Tensor[(1, 112, 112, 16), float32] */;
  %2 = nn.bias_add(%1, %v_param_2, axis=3) /* ty=Tensor[(1, 112, 112, 16), float32] */;

But as you can see fallback log in Q1, autoTVM tries to find config with NCHW layout. Is there way to tune ops only using NHWC layout?

  1. The target should be llvm -device=arm_cpu -model=snapdragon835 -target=aarch64-linux-android -mcpu=kryo -mattr=+neon so that we could benefit the performance advantages of kryo CPU instruction

  2. doesn’t see tune script, can not answer it cleanly.

  3. We have NHWC layout auto tuning since this pr: https://github.com/apache/incubator-tvm/pull/3859. However, we have Legalize Pass converting NHWC to NCHW on ARM CPU currently (https://github.com/apache/incubator-tvm/pull/3859#issuecomment-569221952). You could try it by yourself disabling it. Ideally, we shouldn’t need this legalize pass when we have AutoTVM NHWC tuning, but we need more performance data before doing this. cc: @janimesh

Thank you for answering. I would like to disable legalize pass. But can you tell me where legalize pass works?

And I find out after attempt to add depthwise_conv_nhwc in relay_integration.py and topi_integration.py, n_trial becomes 1. Since I did not implement scheduling function for depthwise_conv2d_nhwc_arm, it seems to fall back default_schedule. I will look further.

All right, seems that we have use ConvertLayout pass to replace original legalize pass now. The things become a little more complicate than I expected. @janimesh implement this pass and should know how to turn off it.

Legalize was removed in favor of AlterOpLayout in this PR - https://github.com/apache/incubator-tvm/pull/4249/files

Currently, AlterOpLayout prefers to make the network go to NCHW. Reason is that we have not done enough performance evaluation for NHWC to be convinced that it is better than NCHW. But, if everybody agrees, I can clean up NHWC to NCHW changes in AlterOpLayout. If one really wants NCHW layout, one can call ConvertLayout in the start.

@Spiraline Please have a look at this file - topi/python/topi/arm_cpu/conv2d.py

and changes in the PR. You will have to modify alter_op_layout. I can look into this next week, but if you need it urgently, you can give it a try. return None inside alter_op_layout function prevents any kind of transformation, so NHWC conv will still be NHWC conv.

We have one closed draft (https://github.com/apache/incubator-tvm-site/pull/2) to show NHWC has better performance compared with NCHW on ARM CPU and explain the reason. However, because our quantized implementation is different with our master branch. And @jackwish unfortunately has leaved our team so this progress is suspend. IMO, I think NHWC / NCHWc is more reasonable choice, It’s quite handy in certain operators to be able to trivially vectorize the output computation, which is more hard for NCHW layouts. So, I will vote for NHWC for ARM CPU / but for mali gpu (used the same alter_op_layout of arm cpu), I will vote for NCHW.

So, I will vote for NHWC for ARM CPU / but for mali gpu (used the same alter_op_layout of arm cpu), I will vote for NCHW.

I don’t have any preference. Will be good to know feedback from other ppl as well @thierry @yzhliu (please add more who will be affected by this)

I think we could also cc @ajtulloch @hlu1 @jwfromm

I’m definitely in favor of allowing NHWC models on ARM CPU as I’ve found it to be the highest performing layout for some models. However, I’m not sure that it’s globally superior to NCHW. Maybe we can make altering to NCHW optional instead of mandatory and let users try both.