How to improve the auto-tune performance?

What can I do if the auto tuned model doesn’t perform well ?

It’s slower than the predefined configurations.

How can I tune to get a better performance ?

I used n_trial = 2000, early_stopping = 600 and get 520 fps.

I modified the tune option to n_trial = 4000, early_stopping = 1200, but the performance did’t get any better.

No tutorial about this?

the details are here: Performance after auto-tune not so good

According to your original post, the performance is even worse after the tuning. It means the explored configs are worse than either the configs on Tophub or the fallback config defined in TOPI. Here are some possibilities that you can give a shot:

  • Do not set early_stopping but just let it run as many trials as you set.
  • Number (20) can be decreased (e.g., 5) and repeat can be just 1. The measurement is usually stable on GPU, although I am not sure which target you used.
  • Minimum repeat time should be at least 1000 (ms) to let AutoTVM judge the repeat time.

Meanwhile, I also suggest analyzing the performance regression after tuning. Specifically, you can search Tophub and try to locate the configs that achieve 700+ fps, and see if those configs were explored in your tuning log. You can even download the Tophub (or you can check if the Tophub log has been downloaded by TVM already in ~/.tvm/tophub) and launch AutoTVM with Tophub log to enable transfer learning.

thank you so much for your advice!

some questions:

1、”Number (20) can be decreased (e.g., 5) and repeat can be just 1“ Did you mean to shorten the tune duration by doing this? I’m using GPU Tesla V100.

2、set the the min_repeat_ms = 1000, why do this ? it’s confusing. according to the API manual , “The minimum duration of one repeat in milliseconds. By default, one repeat contains number runs. If this parameter is set, the parameters number will be dynamically adjusted to meet the minimum duration requirement of one repeat.”

it seems if one forward takes 1 ms, then the number of runs in one repeat will be min_repeat_ms/1ms = 1000? so the number = 5 parameter will be overwritten to 1000? It really takes a lot of time.

I don’t know if I understand it right…

  1. I meant you don’t have to spend the tuning time on unnecessary parts.

  2. If an op latency is about 1 ms and the number is set to 5, then one measurement takes 5 ms. It means the repeat number will be 200 if you set min_repeat_ms=1000, but remember, it still takes 1000 ms in total. The purpose of this setting is to guarantee the measurement accuracy, as the error rate of 1 ms latency could be huge.

You can actually estimate the tuning time, by the way. By default AutoTVM compiles 8 configs in parallel. Assuming it takes 5 secs to compile one config, then measuring 8 configs needs totally 5+8*1=13 seconds (the measurement part cannot be run in parallel because you only have one GPU). Then 4,000 trials will take about (4000/8)*13=6,500 seconds ~1.8 hours. This tuning time is actually common for AutoTVM on GPUs.

Got it. The min_repeat_ms parameter is set for ops whose latency is too short. In that situation, the GPU hasn’t reach its best power performance when the measurement is done.

Thank you. I’ll give it a shot.

Specifically, you can search Tophub and try to locate the configs that achieve 700+ fps,

Still have a question, is there any docs for the file format int the tophub log files? I don’t quite know what these params mean. How does tvm find a config for a specific task ?Using the input tensor and kernel shape? or anything else ? thank you !

{
"i": [
	"cuda -model=tx2", 
	"topi_nn_depthwise_conv2d_nchw",
	 [
		["TENSOR", [1, 32, 112, 112], "float32"], 
		["TENSOR", [32, 1, 3, 3], "float32"],
		 [1, 1], [1, 1], [1, 1], 
		"float32"
	],
	 {},
	[
	"depthwise_conv2d_nchw", 
	[1, 32, 112, 112, "float32"], 
	[32, 1, 3, 3, "float32"], 
	[1, 1], [1, 1], [1, 1], 
	"float32"
	], 

	{
	"i": 4725392, "c": null, 
	"e": [
		["tile_f", "sp", [32, 1, 1, 1]], 
		["tile_y", "sp", [4, 1, 4, 7]], 
		["tile_x", "sp", [2, 1, 56, 1]], 
		["auto_unroll_max_step", "ot", 256], 
		["unroll_explicit", "ot", 1]
		], 
	"t": "direct"
	}
	], 
"r": 
	[
		[0.00012950566279069768], 
		0, 1.7964670658111572, 
		1538703580.0487945
	], 
"v": 0.1
}

Tuning log is not supposed to be read by users so I’m afraid there’s no doc for it. The easiest way is tracing the encoder/decoder directly.

Same reason for the mechanism of determining a config given a task. You could refer to load_reference_log.

I made a try on 2 specific ops. Using min_repeat = 150 and early_stopping = 600, I got 80 GFLOPS for op a and 690GFLOPS for op b.

I set n_trial = 2000, early_stopping=none , and min_repeat_ms = 1000, and run auto-tune for the 2 tasks, and got nearly the same performance. (81gflops and 702 gflops for each).

Do I still need to run auto-tune on the whole model? The model consists of 20 ops.

Seems 150ms is enough for the GPU to get its best performance ? I see this in the api manual.

min_repeat_ms can dynamically adjusts number, so it is recommended. The typical value for NVIDIA GPU is 150 ms.

A model with 20 ops doesn’t mean it has 20 tuning tasks. AutoTVM will generate only one task for repeat ops when extraction. If AutoTVM extracts 20 tasks from your model then you probably need to tune all of them.

You can definitely set the min_repeat_ms to 150 if you feel comfortable. AutoTVM currently doesn’t have adaptive mechanisms to determine these configurations wisely so it highly relies on user experience.

well… I didn’t express myself well , mobilenet actually has 20 tasks to tune. Seems I’d better follow your advice.

Another question,how can I use transfer learning with tvm’s pre-configure for mobilenet?

Seems I need to find the corresponding config in tophub, and apply it to the tuner. But I don’t have any idea how to do it actually.

Do I have to read the source code and trace how the tuner find the match for each convolution ?

I noticed that during the auto-tuning, my GPU power usage is never full to 250W. The highest is about 130W , does it matter?

I asked questions about tuning mobilenet. At the same time, I’m using another PC with V100 GPU to auto-tune resnet50.

I set min_repeat_ms = 3000, the GFLOPS is always 0, and the GPU is always in low power state.And the GPU usage is never higher than 10%.

Is that normal?

here is my configuration.

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(timeout=10),
        runner=autotvm.RPCRunner(
            'V100',
            '0.0.0.0', 9190,
            number=5, repeat=1, timeout=10, min_repeat_ms=3000)
    ),
  • Does the “pre-configure” mean Tophub? If so now AutoTVM won’t use Tophub as the training data (although it’s a reasonable idea). You have to manually download the tophub file (or it might have been downloaded already at ~/.tvm) and just treat it as a tuning log as always.

  • It’s fine. AutoTVM will use GPU for measurement. Since the op latency is usually short (less than a ms), GPU utilization will not be that high.

  • See my reply at Why auto tune always 0.0GFLPS?

Sorry to bother you again… but I’m really confused.

After I modify my llvm version, the mobilenet tuning seems better, and the GFLOPS for each task is higher.

But when all the work is done , I run inference with the tuned model ,the performance doesn’t get any better.

Is there any tools to check this?

I want to compare my tuned model with another faster tuned model(it was tuned by someone else with the same .pb file), to see where the difference is.

specifically I want to print out the cost of each op, and check which op causes the performance difference .

Here I compared the two logs of the 20 tune tasks of mobilent, two lines for each. The first line is the 680fps one , and the second line is my 530 fps model.

Seems on most ops, mine is faster, why the whole model gets worse?

7.296479683567062e-06 this float number means the forward cost , isn’t it?

{“v”: 0.1, “r”: [[7.296479683567062e-06], 0, 31.188103199005127, 1557114554.860901], “i”:│ {“v”: 0.1, “r”: [[6.209928335255671e-06], 0, 1.7554395198822021, 1578469121.1661108], "i

{“v”: 0.1, “r”: [[7.25566608118034e-06], 0, 24.90612006187439, 1557116969.1149082], “i”: │ {“v”: 0.1, “r”: [[3.957661777224805e-06], 0, 1.8577919006347656, 1578471887.4135346], "i

{“v”: 0.1, “r”: [[1.8337903959294006e-05], 0, 19.064162254333496, 1557119485.236], “i”: [│ {“v”: 0.1, “r”: [[9.10231216983461e-06], 0, 1.9765150547027588, 1578474854.435663], “i”:

{“v”: 0.1, “r”: [[1.6040891496989785e-05], 0, 16.555699586868286, 1557120860.9099872], "i│ {“v”: 0.1, “r”: [[4.405155762965572e-06], 0, 1.5967164039611816, 1578477002.8818524], "i

{“v”: 0.1, “r”: [[1.3899782663896583e-05], 0, 14.676212310791016, 1557123288.6902747], "i│ {“v”: 0.1, “r”: [[1.049456236086878e-05], 0, 3.158353090286255, 1578479213.0481741], “i”

{“v”: 0.1, “r”: [[8.9185654084282e-06], 0, 5.638719081878662, 1557124607.3493454], “i”: [│ {“v”: 0.1, “r”: [[3.9061464433796365e-06], 0, 1.9471542835235596, 1578481321.9651315], "

{“v”: 0.1, “r”: [[2.3905133306402747e-05], 0, 9.28469204902649, 1557125807.656742], “i”: │ {“v”: 0.1, “r”: [[1.8082496841155234e-05], 0, 3.263965368270874, 1578483993.1352448], "i

{“v”: 0.1, “r”: [[4.46257793120303e-06], 0, 5.401535511016846, 1557127366.481157], “i”: [│ {“v”: 0.1, “r”: [[3.6464536667874836e-06], 0, 1.8057799339294434, 1578487042.5014699], "

{“v”: 0.1, “r”: [[1.577759098217875e-05], 0, 18.750241994857788, 1557128877.2554607], "i"│ {“v”: 0.1, “r”: [[1.154055362727663e-05], 0, 2.2011630535125732, 1578490298.6760223], "i

{“v”: 0.1, “r”: [[4.082864761074439e-06], 0, 35.93831777572632, 1557130264.4687445], “i”:│ {“v”: 0.1, “r”: [[3.6962382330306295e-06], 0, 1.8576269149780273, 1578493364.439681], "i

{“v”: 0.1, “r”: [[2.864121934161341e-05], 0, 24.57156205177307, 1557131686.2528434], “i”:│ {“v”: 0.1, “r”: [[2.1775893125783536e-05], 0, 2.5706865787506104, 1578495158.0991685], "

{“v”: 0.1, “r”: [[3.5151498093213187e-06], 0, 27.85609459877014, 1557133048.0614216], "i"│ {“v”: 0.1, “r”: [[3.7069885571309425e-06], 0, 1.8315861225128174, 1578497518.9582584], "

{“v”: 0.1, “r”: [[2.0601218188258213e-05], 0, 3.117086887359619, 1557134090.3906138], "i"│ {“v”: 0.1, “r”: [[1.5957815344293542e-05], 0, 2.5050199031829834, 1578499825.8923466], "

{“v”: 0.1, “r”: [[3.189113317964178e-06], 0, 33.99994683265686, 1557135397.6238716], “i”:│ {“v”: 0.1, “r”: [[3.594054648184421e-06], 0, 2.2073705196380615, 1578502196.904582], “i”

{“v”: 0.1, “r”: [[3.773763388494878e-05], 0, 36.667195320129395, 1557136513.0227191], "i"│ {“v”: 0.1, “r”: [[3.099785244774477e-05], 0, 1.8263132572174072, 1578504357.696405], “i”

{“v”: 0.1, “r”: [[3.2446967265672527e-06], 0, 10.421549797058105, 1557137769.7692177], "i│ {“v”: 0.1, “r”: [[3.6159243494874748e-06], 0, 1.870417594909668, 1578506905.0080385], "i

{“v”: 0.1, “r”: [[4.1641094791666666e-05], 0, 18.9304940700531, 1557139387.259979], “i”: │ {“v”: 0.1, “r”: [[2.4386994165694284e-05], 0, 2.4521265029907227, 1578509300.884637], "i

{“v”: 0.1, “r”: [[3.166289999868194e-06], 0, 22.750754833221436, 1557140479.9633937], "i"│ {“v”: 0.1, “r”: [[3.6789659826638027e-06], 0, 2.034290313720703, 1578510777.7544754], "i

{“v”: 0.1, “r”: [[8.745910212919524e-05], 0, 27.56096053123474, 1557141756.825347], “i”: │ {“v”: 0.1, “r”: [[5.1079029651425985e-05], 0, 1.908388614654541, 1578512517.3867476], "i

{“v”: 0.1, “r”: [[3.8376435901534526e-05], 0, 8.293994665145874, 1557143148.589255], “i”:│ {“v”: 0.1, “r”: [[2.2977878777589136e-05], 0, 8.308621644973755, 1578514626.8704207], "i

You can use graph runtime debug mode to dump a breakdown of each CUDA function generated from each op and analyze the bottleneck. You could refer to a previous response for enabling graph runtime debugger (the topic is for CPU, but the profiling approach is the same for all platforms): Profiling a TVM run on CPU

My workmate auto tuned mobilenet and got a tuned model and Mobilenet.log with 20 lines , each line for a task, he auto-tuned this model about 6 months ago.

I have a slow model auto-tuned myself nowadays and a Mobilenet.log as well.

Mine runs 2 ms once and his runs 1.4 ms(500fps vs 680 fps), mine is much slower.

But when I compared the tuning log, I can see that on most of the ops, mine is faster.

And when I compiled the model with the previous log(using the autotvm.apply_history_best), his model is as slow as mine.

Can you help analyze this?

1、Does the Mobilenet.log generated after the tuning process contain full information of this auto-tuing? If I got a log, and just apply it directly, I should expect to get the same performance with the old tuned model , right? 2、there is no exception when I apply the log,will it use a random config if the new-version tvm finds the log not compatible or just throw an error or return something ? It just runs OK and generates a model , But the model is slow , just the same with mine .It’s just 2 ms.

I guess the problem is due to the tvm apply log part . because I use a totally different log and got models with the same performance .

Some help ? thank you !

The log comparison is below :

There is something wrong with the codes I cloned a moth ago.

I pulled the newest codes and it conflicts with tensorflow1.2.

I tried some old versions, and find tag v0.6.0 available.

I used v0.6.0 and apply the log I auto-tuned before , compiled and got a mobilenet model running with performance up to 840 fps. It’s much better than the previous 530fps with the same mobilenet.log.

So there seems to be something wrong with the apply_history api?

Anyway, I gain some experience on solving this problem. I’ll keep this copy of tvm before a more stable version comes out.

Have you used OpenCV in your project? I think apply_history should be ok.

Seems no opencv.

Let me describe it shortly.

1、I auto-tuned mobilenet with an older version about 1 moth ago.And I got a slow model (530fps) and a mobilenet.log containing configurations for each task.

2、I use apply_history to simply compile the model with the mobilenet.log, and got a model still 530fps.

3、I build the v0.6.0 tag , and use the mobilenet.log to compile the model again, and got 840fps.

Step 2 and 3 have no auto-tuning process.