Transfer learning doesn't work (tuner_obj.load_history)

sleepwalker2017 · January 8, 2020, 9:23am

tmp_log_file = log_filename + ".tmp"
print(len(tasks), 'tasks')
for i, tsk in enumerate(reversed(tasks)):
    prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
    # create tuner
    if tuner == 'xgb' or tuner == 'xgb-rank':
        tuner_obj = XGBTuner(tsk, loss_type='rank')
    elif tuner == 'ga':
        tuner_obj = GATuner(tsk, pop_size=100)
    elif tuner == 'random':
        tuner_obj = RandomTuner(tsk)
    elif tuner == 'gridsearch':
        tuner_obj = GridSearchTuner(tsk)
    else:
        raise ValueError("Invalid tuner: " + tuner)

    if use_transfer_learning:
        if os.path.isfile(tmp_log_file):
            tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
    # do tuning
    tuner_obj.tune(n_trial=min(n_trial, len(tsk.config_space)),
                   early_stopping=early_stopping,
                   measure_option=measure_option,
                   callbacks=[
                       autotvm.callback.progress_bar(n_trial, prefix=prefix),
                       autotvm.callback.log_to_file(tmp_log_file)])

Here are my codes. I’ve done auto-tuning with 13 out of 23 tasks, and then restarted it.

before I restart, the log is like:

[Task 1/23] Current/Best: 1077.78/1799.04 GFLOPS | Progress: (1160/2000) | 2628.82 s Done. [Task 2/23] Current/Best: 2632.91/3759.71 GFLOPS | Progress: (1080/2000) | 3420.68 s Done. [Task 3/23] Current/Best: 86.04/4981.69 GFLOPS | Progress: (1360/2000) | 4965.42 s Done. [Task 4/23] Current/Best: 6573.18/7858.29 GFLOPS | Progress: (840/2000) | 3622.84 s Done. [Task 5/23] Current/Best: 2926.18/3372.58 GFLOPS | Progress: (1160/2000) | 4330.86 s Done. [Task 6/23] Current/Best: 233.01/4516.10 GFLOPS | Progress: (1120/2000) | 4550.76 s Done. [Task 7/23] Current/Best: 7413.94/8618.58 GFLOPS | Progress: (1320/2000) | 6722.76 s Done. [Task 8/23] Current/Best: 947.91/6387.67 GFLOPS | Progress: (1240/2000) | 5181.29 s Done. [Task 9/23] Current/Best: 2199.75/5921.06 GFLOPS | Progress: (1120/2000) | 4584.77 s Done. [Task 10/23] Current/Best: 0.73/3877.09 GFLOPS | Progress: (2000/2000) | 8261.93 s Done. [Task 11/23] Current/Best: 630.80/3405.00 GFLOPS | Progress: (1680/2000) | 6506.84 s Done. [Task 12/23] Current/Best: 7196.21/9322.40 GFLOPS | Progress: (1160/2000) | 5283.66 s Done. [Task 13/23] Current/Best: 4453.55/5437.99 GFLOPS | Progress: (1160/2000) | 4695.83 s Done. [Task 14/23] Current/Best: 329.66/4925.87 GFLOPS | Progress: (1560/2000) | 6217.55 s

But after I restart, the job still begins with the 1st task,

[Task 1/23] Current/Best: 1075.06/1238.92 GFLOPS | Progress: (120/2000) | 278.33 s

Why should it not continue from the 14th task ?

Or do I have some misunderstanding with transfer learning?

Hoping for replies.

comaniac · January 8, 2020, 5:40pm

Transfer learning is not used for resuming a task. When you use XGBTuner, the tuner maintains a cost model to predict the next schedule config to be explored. Usually the cost model is trained from scratch (random start, improved over trials). When enabling transfer learning with previous tuning log, the cost model will be trained by the tuning log instead of random start, so it is expected to predict a better next schedule config. On the other hand, the tuning process is still starting from the first trial.

jwfromm · January 9, 2020, 10:33pm

See this thread for a discussion on resuming tuning.

sleepwalker2017 · January 13, 2020, 3:07am

OK, thank u.I’ll try it.

sleepwalker2017 · January 13, 2020, 3:11am

I have done 13 of 23 tasks before I killed it.

And then I restart it with transfer learning.

I noticed that, task 5 reached 3372 GFLOPS in the first tuning. But in the second tuning , the best is 3369. If the transfer learning is working, it should be at least the same as before. Seems the previous log is not referenced ?

comaniac · January 13, 2020, 7:19pm

Not really. As I explained earlier, transfer learning uses the tuning log to train a cost model, but it didn’t directly use any config in the log, so it’s not guaranteed to achieve the same or better performance. I also noticed this issue for a while and this is one of many AutoTVM issues that we AWS are planning to improve in 2020.