Stuck when auto-tune


#1

Hi, I am auto-tuning a resnet20 on cifar10 iteratively. In the x epoch, I set the trials in the “tuning options” to x.

However, it turns out the program will stuck. for example.
Tuning…
[Task 1/14] Current/Best: 11.81/ 59.79 GFLOPS | Progress: (27/27) | 70.04 s Done.
[Task 2/14] Current/Best: 5.65/ 61.04 GFLOPS | Progress: (27/27) | 71.72 s Done.
[Task 3/14] Current/Best: 3.45/ 16.33 GFLOPS | Progress: (27/27) | 65.15 s Done.
[Task 4/14] Current/Best: 8.78/ 20.19 GFLOPS | Progress: (27/27) | 69.58 s Done.
[Task 5/14] Current/Best: 8.42/ 8.58 GFLOPS | Progress: (27/27) | 74.90 s Done.
[Task 6/14] Current/Best: 1.18/ 20.31 GFLOPS | Progress: (27/27) | 76.74 s Done.
[Task 7/14] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/27) | 0.00 s

^CProcess ForkPoolWorker-2207:
Process ForkPoolWorker-2208:
Process ForkPoolWorker-2203:
Process ForkPoolWorker-2154:
Process ForkPoolWorker-2152:
Process Process-2155:
Process ForkPoolWorker-2149:

It seems that it stucks in the 27th epoch. Not sure why, any suggestions?

Thanks,


#2

Have you checked what’s the size of the ConfigSpace for each task? The overall number of configurations may equals 27 and is smaller than your number of trials.


#3

Well, I think I use

tuner_obj.tune(n_trial=min(n_trial, len(tsk.config_space)),
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(n_trial, prefix=prefix),
autotvm.callback.log_to_file(tmp_log_file)])

where n_tiral = 27 in this case. There is actually a “min” function to avoid the config space overflow.
So, I am not sure it is caused by a small config space.
Thanks,


#4

I printed the config space, they varies from 70k-9M for different layers in resnet20. However, much larger than 27.


#5

Could you share the full code?
The progress may also stop because of small early stopping or that number of trials changes between defining it and using during tuning.


#6

currently, I am using epochs (= 1, 2, 3, 4, … 100) for the trials and (epochs // 2) for early stopping number. Not quite understanding “that number of trials changes between defining it and using during tuning.” Do you mean if the config space length and the definition of trials are too close, there will be problems?

Any suggestions on setting the n_trials and early stopping number? I would like to set n_trials=1000 and early stopping to 500, however, it’s too time consuming.


#7

You said:

I think you answer your own question: The progress is stuck in 27 because you set n_trail to 27.

When it comes to tuning the scheduler, it depends on chosen tuner and length of ConfigSpace. If you want to get the best performance, I think, you should pick XGBTuner and set n_trials to 500-1000 at least. As your length of ConfigSpace is about 10^5-10^6, then, it won’t be even 5% of all configurations.


#8

currently, I use a small number as the n_trial just for convenience. I would like to make sure the program is functionally correct and the performance is later consideration.

So you mean, the program may not hang if I use a larger number of the n_trials? what early stopping number should I use?

that’s a quite strange, why it hangs for a small number of trials while doesn’t hang for a larger number?


#9

It doesn’t “hangs” for a small number of trials.Number of trials is a parameter where you specify how many configurations will be checked for the task.

So, the progress equals 27 for each task because you set it to 27.

If you would like to change it to another number, just modify n_trials.


#10

Any suggestions on how to avoid the hangs? I mean except setting n_trial to 500-1000, that’s too time consuming?
Or any suggestions to accelerate the auto-tune process?

Thanks,