Auto-tuning too slow

Hello,
I’m a fresh user of TVM and recently I follow the tutorial to quantize and auto-tuning models for NVIDIA T4. But the auto-tuning program runs too slowly.

Extract tasks...
Tuning...
[Task  1/18]  Current/Best:  249.99/ 291.56 GFLOPS | Progress: (768/1500) | 3791.38 s Done.
[Task  2/18]  Current/Best:    8.29/ 431.15 GFLOPS | Progress: (672/1500) | 7153.01 s Done.
[Task  3/18]  Current/Best:  414.85/ 442.44 GFLOPS | Progress: (720/1500) | 9403.10 s Done.

The program is executed on Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz. It seems that the program only occupies 1 physical core from htop.

Is this speed normal? How can I utilize multi-cores to speed up it?

This is normal. The Auto-tuning measures the candidate configs directly on the target device, which is GPU in your case. Since you have only one GPU, all 1,500 trials are performed sequentially and every tasks are also performed sequentially. The only way to speed it up at this moment is enabling distributed tuning with multiple GPUs or multiple machines.

I feel confused that both CPU and GPUs are not busy during tuning.

Actually I register five NVIDIA T4, following the steps from the tutorial, but the nvidia-smi shows most of time the GPUs are free and only one GPU is used which looks strange.

Here is my code snippet:

 #### DEVICE CONFIG ####
 target = tvm.target.cuda()

 #### TUNING OPTION ####
 network = 'aa'
 log_file = "%s.log" % network
 dtype = 'float32'

 tuning_option = {
     'log_filename': log_file,

     'tuner': 'xgb',
     'n_trial': 1500,
     'early_stopping': 400,

     'measure_option': autotvm.measure_option(
         builder=autotvm.LocalBuilder(timeout=10),
         #runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),
         runner=autotvm.RPCRunner(
             'T4',  # change the device key to your key
             '0.0.0.0', 9190,
             number=20, repeat=3, timeout=4, min_repeat_ms=150)
     ),
 }
 def tune_tasks(tasks,
                measure_option,
                tuner='xgb',
                n_trial=1000,
                early_stopping=None,
                log_filename='tuning.log',
                use_transfer_learning=True):

     # create tmp log file
     tmp_log_file = log_filename + ".tmp"
     if os.path.exists(tmp_log_file):
         os.remove(tmp_log_file)

     for i, tsk in enumerate(reversed(tasks)):
         prefix = "[Task %2d/%2d] " %(i+1, len(tasks))

         # create tuner
         if tuner == 'xgb' or tuner == 'xgb-rank':
             tuner_obj = XGBTuner(tsk, loss_type='rank')
         elif tuner == 'ga':
             tuner_obj = GATuner(tsk, pop_size=100)
         elif tuner == 'random':
             tuner_obj = RandomTuner(tsk)
         elif tuner == 'gridsearch':
             tuner_obj = GridSearchTuner(tsk)
         else:
             raise ValueError("Invalid tuner: " + tuner)

         if use_transfer_learning:
             if os.path.isfile(tmp_log_file):
                 tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))

         # do tuning
         tuner_obj.tune(n_trial=min(n_trial, len(tsk.config_space)),
                        early_stopping=early_stopping,
                        measure_option=measure_option,
                        callbacks=[
                            autotvm.callback.progress_bar(n_trial, prefix=prefix),
                            autotvm.callback.log_to_file(tmp_log_file)])

     # pick best records to a cache file
     autotvm.record.pick_best(tmp_log_file, log_filename)
     os.remove(tmp_log_file)
 def tune_and_evaluate(tuning_opt):
     # extract workloads from relay program
     print("Extract tasks...")
     mod, params, input_shape, out_shape = get_network(network, batch_size=1)
     with relay.quantize.qconfig(store_lowbit_output=False):
         mod['main'] = relay.quantize.quantize(mod['main'], params=params)
     tasks = autotvm.task.extract_from_program(mod['main'], target=target,
                                             params=params, ops=(relay.op.nn.conv2d,))
     for i in range(len(tasks)):
         tsk = tasks[i]
         input_channel = tsk.workload[1][1]
         output_channel = tsk.workload[1][0]
         if output_channel % 4 == 0 and input_channel % 4 == 0:
             tsk = autotvm.task.create(tasks[i].name, tasks[i].args,
                                       tasks[i].target, tasks[i].target_host, 'int8')
             tasks[i] = tsk


     # run tuning tasks
     print("Tuning...")
     tune_tasks(tasks, **tuning_opt)

     # compile kernels with history best records
     with autotvm.apply_history_best(log_file):
         print("Compile...")
         with relay.build_config(opt_level=3):
             graph, lib, params = relay.build_module.build(
                 mod, target=target, params=params)

         # export library
         tmp = tempdir()
         filename = "net.tar"
         lib.export_library(tmp.relpath(filename))

         # load parameters
         ctx = tvm.context(str(target), 0)
         module = runtime.create(graph, lib, ctx)
         data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
         module.set_input('data', data_tvm)
         module.set_input(**params)

         # evaluate
         print("Evaluate inference time cost...")
         ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=60)
         prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
         print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
               (np.mean(prof_res), np.std(prof_res)))

 tune_and_evaluate(tuning_option)

Maybe the RPC doesn’t work normally I guess?
@kevinthesun do you have any idea?

The program gives me a feeling that it is “frozen”.
The utilization of CPU is low and GPUs don’t work. Even the tuning information is updated about every 1000 or 2000 seconds. In the interval, I don’t know what it is doing.

Have you set TVM_NUM_THREADS environ var?

I try to set the TVM_NUM_THREADS on my laptop (the server has run over 15 hours and I don’t want to affect it). It seems that the program still only occupies 1 physical core.

CPU: Intel® Core™ i7-8750H CPU @ 2.20GHz 6C12T
GPU: GeForce GTX 1060 Mobile

Here is my operation procedure:

  1. Start an RPC tracker
python3 -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
  1. Start an RPC server.
python3 -m tvm.exec.rpc_server --tracker=0.0.0.0:9190 --key=1060
  1. Set the environ var
export TVM_NUM_THREADS=6
  1. Start tuning
python3 tune_relay_int8_cuda.py

On the server, I repated the step 2 to register five T4 GPUs but didn’t set the TVM_NUM_THREADS environ var.

The tutorial says a high performance CPU can be helpful. So I guess there is a problem in my situation.

The tuning needs to compile many programs and extract feature from them. So a high performance CPU is recommended. One sample output is listed below. It takes about 4 hours to get the following output on a 32T AMD Ryzen Threadripper. The tuning target is NVIDIA 1080 Ti. (You can see some errors during compilation. If the tuning is not stuck, it is okay.)

UPDATE
I run the tuning program on another server. Export TVM_NUM_THREADS=24 in all terminals and register two P4 GPUs. But it doesn’t accelerate the tuning speed (still only one CPU core is busy and GPUs are not always busy)

Did I do something wrong?:confused:

it is normal that gpu has low utilization. Kernels are profiled on gpu sequentially and each kernel usually doesn’t occupy the whole gpu.

Can I assign a specific GPU to execute the auto-tuning program?

You can run the program with CUDA_VISIBLE_DEVICES env variable

It works if I only run one auto-tuning program. But if I run two auto-tuning programs, they both are executed on the first GPU again.

The problem is that I have many models to auto-tune while the auto-tuning speed is too slow.

How can I auto-tune them in parallel (e.g. one model is auto-tuning on one GPU card)?
Or how can I use all GPUs to speed up auto-tuning program? (the Scale up measurement by using multiple devices in the tutorial doesn’t speed it up)

Thx.

Using multiple gpu will help, but it is also limited by cpu. You can set up rpc server for each gpu by setting different env var

Yeah, I find that CPU is very important to auto-tuning. But I have no idea how to utilize all CPU cores. I try to set the TVM_NUM_THREADS env var or use the taskset, but just like the picture I posted above, only one CPU core is used.
Could you give me some suggestions?
Thx a lot.

Setting the TVM_BIND_THREADS env var can solve my problem.

It is strange that the default setting is to use only one CPU core and the dev doc doesn’t mention it at all.

2 Likes

This is strange since TVM_BIND_THREADS is not related to autotvm. By default all cores will be used (xgboost tuner by default starts a process pool using all cores) I suspect there is some issue with tvm threading backend.
This issue might be related: TVM & OpenCV Compatibility Issue.
I met some similar issue before when I start a relay interpreter and then run autotvm tuning

1 Like

I tried your method, but the tuning speed is not accelerated obviously compared to using cpu or single gpu, is that right?