[solved]Can we resume an autotuning session?


#1

Hello,

Is there a “resume autotuning session” mecanism implemented (yet)?
It is quite frustrating to run a long autotuning session over night, see that it failed in the last task, and not being able to recover all the work already done during the autotuning session. Also it would be nice to have the possibility to stop a session and restart it later.

Thank you for your work.


#2

Assuming you’re saving the best configurations to your log file using autotvm.record.pick_best, an easy way to resume training is by only tuning tasks that aren’t found in the record file. Here’s a python snippet that implements this.

def prune_old_tasks(tasks, log_file):
    if os.path.isfile(log_file):
        new_tasks = []
        history = autotvm.record.ApplyHistoryBest(log_file)
        for task in tasks:
            if history._query_inside(task.target, task.workload) is None:
                new_tasks.append(task)
        return new_tasks
    else:
        return tasks

Apply this function after task extraction to get a list of the remaining untuned tasks.


#3

Thanks you, happy to see this feature!
Although I think there may be some lost if I interrupt an autotuning task and recover it, I interrupted an autotuning session before completing task 1/11, after resuming I start at 1/10, so part of the first task have been skipped.

It would be nice to see some documentation about it.


#5

You’re right that this method only recovers fully completed tasks. It’s a little trickier to resume mid-task but I imagine a similar process could be applied. I’ll think about the best way to integrate this function into the current tuning infrastructure and issue a PR.


#6

I don’t know exactly what the temporary file contains and what is stored during the autotuning process, but maybe I could ommit the part corresponding to the last task in the log file so that I am sure to include only completed tasks.

Is that possible?


#7

Okay, sorry for not having looking enough by myself. thanks to your example I fount it was quite easy to implement it by myself with existing features, it just needed a modification of the autotuning script.

So basically I am using to temporay files to save the work done:

  • [model].log.tmp, checkpoints after each task completed, can be resumed without loss
  • [model].log.task.tmp, work done for the processing task, cannot be resumed without loss (no need to save it in case of exit or failure)
import os
import tempfile

tmp_log = log + '.tmp'
def tune_tasks(...):
    ...
    for i, tsk in enumerate(reversed(tasks)):
        ...
        # in case of transfer learning use the completed tasks log
        if use_transfer_learning and os.path.isfile(tmp_log):
            tuner_obj.load_history(autotvm.record.load_from_file(tmp_log)

        with tempfile.NamedTemporaryFile() as tmp_task_log_file:    
            # tune in a the blank temporary file
            tuner_obj.tune(..., callbacks=[..., autotvm.callback.log_to_file(tmp_task_log.name)])
            # task completed, append the task log to the checkpoints log
            with open(tmp_log, 'a') as tmp_log_file:
                tmp_log_file.write(tmp_task_log_file.read().decode('utf8'))
    # after tuning each tasks, pick the best ones from the tmp_log and remove tmp_log
    autotvm.record.pick_best(tmp_log, log)
    os.remove(tmp_log)

That way I can use resume a tuning session as you described, from the checkpoint file ([model].log.tmp or tmp_log), without any loss of work.

Thank you.