Poor relative performance from XGBTuner

I’ve been experimenting with auto-tuning for some time now - mostly for Arm CPU. One observation is that the XGBoost tuner seems to take far longer to run than the random tuner without delivering significantly better results. This seems to be due to how long it takes to incrementally train the model.

Some indicative results are below:


[Task  1/27]  Current/Best:    7.81/  56.98 GFLOPS | Progress: (280/1000) | 66.49 s Done.
[Task  2/27]  Current/Best:   23.88/  76.53 GFLOPS | Progress: (784/1000) | 268.12 s Done.
[Task  3/27]  Current/Best:    5.32/  80.03 GFLOPS | Progress: (392/1000) | 150.57 s Done.
[Task  4/27]  Current/Best:   45.82/  71.48 GFLOPS | Progress: (280/1000) | 59.22 s Done.
[Task  5/27]  Current/Best:   64.20/  70.92 GFLOPS | Progress: (448/1000) | 94.77 s Done.

XGB (loaded history is the amount of time in seconds to load the history log):

[Task  1/27]  Current/Best:    8.73/  57.20 GFLOPS | Progress: (504/1000) | 391.25 s Done.
Loaded history in 100.32719016075134
[Task  2/27]  Current/Best:   13.29/  71.15 GFLOPS | Progress: (504/1000) | 466.36 s Done.
Loaded history in 225.86001634597778

---- By this point it had already taken more than twice the total time of the random tuner ----

For all these tests I’m tuning with min_repeat_ms=10, timeout=1, n_parallel=28, early_stopping=250 and am using two target devices.

Should I be expecting better performance from the XGBoost tuner (are there some relevant options to tweak?), or is it the case that past a certain hardware performance threshold it’s more beneficial to spend the compute time with more random trials than training the XGBoost model?

Here are my understanding and some thoughts.

The XGBTuner in AutoTVM uses XGBoost to train a decision tree like model, and uses that model to predict the next config with potential good performance. The cost function is based on simulated annealing. Not surprisingly, the training phase dominates the XGBTuner execution time. This is the reason why XGBTuner takes longer time compared with random tuner in every trial.

On the other hand, the simulated annealing based cost function is known to be learning slow, so the performance improvement may not obvious compared with random search when the trial number is few. For example, we usually set it to 3,000 or even 4,000 when tuning NVIDIA V100.

In addition, there’s one thing you can try – the transfer learning. You may have noticed that you can provide a previous tuning log to XGBTuner to enable transfer learning. This mechanism works not only for the same task but the different tasks with the same op. In other words, the tuning log for conv task 1 can also be the training data for conv task 2. Although I didn’t systematically analyze the benefit of this approach, it should more or less facilitate the learning process.

I have been using the transfer learning (sorry I forgot to mention that explicitly) and it’s responsible for the ‘Loaded history in x’ lines. The amount of time to load the log steadily grows throughout the training until it takes a similar amount of time to do the transfer learning as it does to tune a task. I could try running for more iterations, but it’s very difficult to justify tuning taking almost a day per network.

I saw your selective tuning PR, the results of which seem to imply that there’s probably a subset of ‘good’ configs which come up quite repetitively rather than each workload arriving at a substantitively different optimal config. Is there scope for using that idea to do a single large tuning session across a range of representative workloads? Those results could then be used similarly to TopHub but with the workloads mapped by similarity. I think it would be much easier to justify the cost of the tuning if we knew that once it was done no further tuning was required for the operator.

Unfortunately current AutoTVM implementation doesn’t have such mechanism, and that’s one reason I did that PR. However, we may want to re-plan the whole thing to make the AutoTVM more general and reasonable, so I am not working that PR for now.

I also found that XGBTuner may not generate stable result. In one trial it generates a 231.61 GFLOPS solution, but in another one it only gets a 96.00 GFLOPS result.

this is true, even i have faced this issues too, about unstable results