I’ve been experimenting with auto-tuning for some time now - mostly for Arm CPU. One observation is that the XGBoost tuner seems to take far longer to run than the random tuner without delivering significantly better results. This seems to be due to how long it takes to incrementally train the model.
Some indicative results are below:
Random:
[Task 1/27] Current/Best: 7.81/ 56.98 GFLOPS | Progress: (280/1000) | 66.49 s Done.
[Task 2/27] Current/Best: 23.88/ 76.53 GFLOPS | Progress: (784/1000) | 268.12 s Done.
[Task 3/27] Current/Best: 5.32/ 80.03 GFLOPS | Progress: (392/1000) | 150.57 s Done.
[Task 4/27] Current/Best: 45.82/ 71.48 GFLOPS | Progress: (280/1000) | 59.22 s Done.
[Task 5/27] Current/Best: 64.20/ 70.92 GFLOPS | Progress: (448/1000) | 94.77 s Done.
XGB (loaded history is the amount of time in seconds to load the history log):
[Task 1/27] Current/Best: 8.73/ 57.20 GFLOPS | Progress: (504/1000) | 391.25 s Done.
Loaded history in 100.32719016075134
[Task 2/27] Current/Best: 13.29/ 71.15 GFLOPS | Progress: (504/1000) | 466.36 s Done.
Loaded history in 225.86001634597778
---- By this point it had already taken more than twice the total time of the random tuner ----
For all these tests I’m tuning with min_repeat_ms=10, timeout=1, n_parallel=28, early_stopping=250 and am using two target devices.
Should I be expecting better performance from the XGBoost tuner (are there some relevant options to tweak?), or is it the case that past a certain hardware performance threshold it’s more beneficial to spend the compute time with more random trials than training the XGBoost model?