How to improve the auto-tune performance?

well… I didn’t express myself well , mobilenet actually has 20 tasks to tune. Seems I’d better follow your advice.

Another question,how can I use transfer learning with tvm’s pre-configure for mobilenet?

Seems I need to find the corresponding config in tophub, and apply it to the tuner. But I don’t have any idea how to do it actually.

Do I have to read the source code and trace how the tuner find the match for each convolution ?

I noticed that during the auto-tuning, my GPU power usage is never full to 250W. The highest is about 130W , does it matter?

I asked questions about tuning mobilenet. At the same time, I’m using another PC with V100 GPU to auto-tune resnet50.

I set min_repeat_ms = 3000, the GFLOPS is always 0, and the GPU is always in low power state.And the GPU usage is never higher than 10%.

Is that normal?

here is my configuration.

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(timeout=10),
        runner=autotvm.RPCRunner(
            'V100',
            '0.0.0.0', 9190,
            number=5, repeat=1, timeout=10, min_repeat_ms=3000)
    ),
  • Does the “pre-configure” mean Tophub? If so now AutoTVM won’t use Tophub as the training data (although it’s a reasonable idea). You have to manually download the tophub file (or it might have been downloaded already at ~/.tvm) and just treat it as a tuning log as always.

  • It’s fine. AutoTVM will use GPU for measurement. Since the op latency is usually short (less than a ms), GPU utilization will not be that high.

  • See my reply at Why auto tune always 0.0GFLPS?

Sorry to bother you again… but I’m really confused.

After I modify my llvm version, the mobilenet tuning seems better, and the GFLOPS for each task is higher.

But when all the work is done , I run inference with the tuned model ,the performance doesn’t get any better.

Is there any tools to check this?

I want to compare my tuned model with another faster tuned model(it was tuned by someone else with the same .pb file), to see where the difference is.

specifically I want to print out the cost of each op, and check which op causes the performance difference .

Here I compared the two logs of the 20 tune tasks of mobilent, two lines for each. The first line is the 680fps one , and the second line is my 530 fps model.

Seems on most ops, mine is faster, why the whole model gets worse?

7.296479683567062e-06 this float number means the forward cost , isn’t it?

{“v”: 0.1, “r”: [[7.296479683567062e-06], 0, 31.188103199005127, 1557114554.860901], “i”:│ {“v”: 0.1, “r”: [[6.209928335255671e-06], 0, 1.7554395198822021, 1578469121.1661108], "i

{“v”: 0.1, “r”: [[7.25566608118034e-06], 0, 24.90612006187439, 1557116969.1149082], “i”: │ {“v”: 0.1, “r”: [[3.957661777224805e-06], 0, 1.8577919006347656, 1578471887.4135346], "i

{“v”: 0.1, “r”: [[1.8337903959294006e-05], 0, 19.064162254333496, 1557119485.236], “i”: [│ {“v”: 0.1, “r”: [[9.10231216983461e-06], 0, 1.9765150547027588, 1578474854.435663], “i”:

{“v”: 0.1, “r”: [[1.6040891496989785e-05], 0, 16.555699586868286, 1557120860.9099872], "i│ {“v”: 0.1, “r”: [[4.405155762965572e-06], 0, 1.5967164039611816, 1578477002.8818524], "i

{“v”: 0.1, “r”: [[1.3899782663896583e-05], 0, 14.676212310791016, 1557123288.6902747], "i│ {“v”: 0.1, “r”: [[1.049456236086878e-05], 0, 3.158353090286255, 1578479213.0481741], “i”

{“v”: 0.1, “r”: [[8.9185654084282e-06], 0, 5.638719081878662, 1557124607.3493454], “i”: [│ {“v”: 0.1, “r”: [[3.9061464433796365e-06], 0, 1.9471542835235596, 1578481321.9651315], "

{“v”: 0.1, “r”: [[2.3905133306402747e-05], 0, 9.28469204902649, 1557125807.656742], “i”: │ {“v”: 0.1, “r”: [[1.8082496841155234e-05], 0, 3.263965368270874, 1578483993.1352448], "i

{“v”: 0.1, “r”: [[4.46257793120303e-06], 0, 5.401535511016846, 1557127366.481157], “i”: [│ {“v”: 0.1, “r”: [[3.6464536667874836e-06], 0, 1.8057799339294434, 1578487042.5014699], "

{“v”: 0.1, “r”: [[1.577759098217875e-05], 0, 18.750241994857788, 1557128877.2554607], "i"│ {“v”: 0.1, “r”: [[1.154055362727663e-05], 0, 2.2011630535125732, 1578490298.6760223], "i

{“v”: 0.1, “r”: [[4.082864761074439e-06], 0, 35.93831777572632, 1557130264.4687445], “i”:│ {“v”: 0.1, “r”: [[3.6962382330306295e-06], 0, 1.8576269149780273, 1578493364.439681], "i

{“v”: 0.1, “r”: [[2.864121934161341e-05], 0, 24.57156205177307, 1557131686.2528434], “i”:│ {“v”: 0.1, “r”: [[2.1775893125783536e-05], 0, 2.5706865787506104, 1578495158.0991685], "

{“v”: 0.1, “r”: [[3.5151498093213187e-06], 0, 27.85609459877014, 1557133048.0614216], "i"│ {“v”: 0.1, “r”: [[3.7069885571309425e-06], 0, 1.8315861225128174, 1578497518.9582584], "

{“v”: 0.1, “r”: [[2.0601218188258213e-05], 0, 3.117086887359619, 1557134090.3906138], "i"│ {“v”: 0.1, “r”: [[1.5957815344293542e-05], 0, 2.5050199031829834, 1578499825.8923466], "

{“v”: 0.1, “r”: [[3.189113317964178e-06], 0, 33.99994683265686, 1557135397.6238716], “i”:│ {“v”: 0.1, “r”: [[3.594054648184421e-06], 0, 2.2073705196380615, 1578502196.904582], “i”

{“v”: 0.1, “r”: [[3.773763388494878e-05], 0, 36.667195320129395, 1557136513.0227191], "i"│ {“v”: 0.1, “r”: [[3.099785244774477e-05], 0, 1.8263132572174072, 1578504357.696405], “i”

{“v”: 0.1, “r”: [[3.2446967265672527e-06], 0, 10.421549797058105, 1557137769.7692177], "i│ {“v”: 0.1, “r”: [[3.6159243494874748e-06], 0, 1.870417594909668, 1578506905.0080385], "i

{“v”: 0.1, “r”: [[4.1641094791666666e-05], 0, 18.9304940700531, 1557139387.259979], “i”: │ {“v”: 0.1, “r”: [[2.4386994165694284e-05], 0, 2.4521265029907227, 1578509300.884637], "i

{“v”: 0.1, “r”: [[3.166289999868194e-06], 0, 22.750754833221436, 1557140479.9633937], "i"│ {“v”: 0.1, “r”: [[3.6789659826638027e-06], 0, 2.034290313720703, 1578510777.7544754], "i

{“v”: 0.1, “r”: [[8.745910212919524e-05], 0, 27.56096053123474, 1557141756.825347], “i”: │ {“v”: 0.1, “r”: [[5.1079029651425985e-05], 0, 1.908388614654541, 1578512517.3867476], "i

{“v”: 0.1, “r”: [[3.8376435901534526e-05], 0, 8.293994665145874, 1557143148.589255], “i”:│ {“v”: 0.1, “r”: [[2.2977878777589136e-05], 0, 8.308621644973755, 1578514626.8704207], "i

You can use graph runtime debug mode to dump a breakdown of each CUDA function generated from each op and analyze the bottleneck. You could refer to a previous response for enabling graph runtime debugger (the topic is for CPU, but the profiling approach is the same for all platforms): Profiling a TVM run on CPU

My workmate auto tuned mobilenet and got a tuned model and Mobilenet.log with 20 lines , each line for a task, he auto-tuned this model about 6 months ago.

I have a slow model auto-tuned myself nowadays and a Mobilenet.log as well.

Mine runs 2 ms once and his runs 1.4 ms(500fps vs 680 fps), mine is much slower.

But when I compared the tuning log, I can see that on most of the ops, mine is faster.

And when I compiled the model with the previous log(using the autotvm.apply_history_best), his model is as slow as mine.

Can you help analyze this?

1、Does the Mobilenet.log generated after the tuning process contain full information of this auto-tuing? If I got a log, and just apply it directly, I should expect to get the same performance with the old tuned model , right? 2、there is no exception when I apply the log,will it use a random config if the new-version tvm finds the log not compatible or just throw an error or return something ? It just runs OK and generates a model , But the model is slow , just the same with mine .It’s just 2 ms.

I guess the problem is due to the tvm apply log part . because I use a totally different log and got models with the same performance .

Some help ? thank you !

The log comparison is below :

There is something wrong with the codes I cloned a moth ago.

I pulled the newest codes and it conflicts with tensorflow1.2.

I tried some old versions, and find tag v0.6.0 available.

I used v0.6.0 and apply the log I auto-tuned before , compiled and got a mobilenet model running with performance up to 840 fps. It’s much better than the previous 530fps with the same mobilenet.log.

So there seems to be something wrong with the apply_history api?

Anyway, I gain some experience on solving this problem. I’ll keep this copy of tvm before a more stable version comes out.

Have you used OpenCV in your project? I think apply_history should be ok.

Seems no opencv.

Let me describe it shortly.

1、I auto-tuned mobilenet with an older version about 1 moth ago.And I got a slow model (530fps) and a mobilenet.log containing configurations for each task.

2、I use apply_history to simply compile the model with the mobilenet.log, and got a model still 530fps.

3、I build the v0.6.0 tag , and use the mobilenet.log to compile the model again, and got 840fps.

Step 2 and 3 have no auto-tuning process.

Besides apply_history, another reason is maybe the schedule has been changed. Even you leverage the old tuned log, you could have possibility to get better performance.

I have another log and lib with 680 fps , generated by my workmate 5 months ago.

With the 1-month-ago version code, I apply history with the 680fps log, and still got 530 fps deploy_*.so .params etc. That is not normal , right?

Then I checkout to tag v0.6.0 and recompiled the model with 680 fps log, and got 771.36 fps. That should be the result of the schedule change.

The conflict is as below:

The slower log generated 680 fps with tvm about 5 months ago .

Using a 1-month-ago version tvm ,the faster log can only got 530 FPS and the slower log also gets 530FPS .

Using the V0.6.0 tag , they get 771 and 840 FPS for each.

I think it should be the schedule change. I can not make sure whether it is this PR: https://github.com/apache/incubator-tvm/pull/4511 results in. Asymmetric padding will result in our padding workload is 4D now (Previous is 2D), so previous log maybe can not work correctly. Have you tried the latest TVM (tune from scratch and run)? It should be ok.

I used V0.6.0 and the most recent commit is

commit c6f8c23c349f3ef8bacceaf3203f7cc08e6529de Author: Thierry Moreau moreau@uw.edu Date: Tue Nov 26 19:21:56 2019 -0800

[VTA][HotFix] Relay->VTA quantization fix (#4433)

* relay -> vta fix

* setting optlevel to 3 for quantization to fold batchnorm

Seems this pr is 7days ago and has no relation with our discussion.

I cloned the latest codes on github, and it runs crashed with this log. I’m using tf 1.12.0. Any advice ?

Traceback (most recent call last):

File “from_tf.py”, line 17, in import tvm.relay.testing.tf as tf_testing

File “/home/cephfs/data/tvm/python/tvm/relay/testing/tf.py”, line 34, in tf_compat_v1 = tf.compat.v1

AttributeError: module ‘tensorflow._api.v1.compat’ has no attribute ‘v1’

This error should be not related our issue. You could workaround it just to replace v6.0 tf.py and do the next.

1、You mean to replace the tf.py with the v0.6.0 version and use the latest code? I tried this, seems it’s ok. 2、If I use the latest code, the older log is no longer available?

I used the latest code to apply the older log, it’s 530 fps again. This log should be 840fps with V0.6.0.

And also, I run the latest version to generate a model with the pre-defined configuration in tophub, it crashed.

Traceback (most recent call last):

File “from_tf.py”, line 445, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py”, line 125, in run _sys.exit(main(argv))

File “from_tf.py”, line 395, in main eval_tvm(image)

File “from_tf.py”, line 189, in eval_tvm module.run()

File “/home/hadoop-hdp/cephfs/data/yuweilong/tvm/python/tvm/contrib/graph_runtime.py”, line 169, in run self._run()

File “/home/hadoop-hdp/cephfs/data/yuweilong/tvm/python/tvm/_ffi/_ctypes/function.py”, line 207, in call raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last): [bt] (8) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7fef40a1ca61] [bt] (7) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x47) [0x7fef40a66167] [bt] (6) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(+0x12040d7) [0x7fef40a660d7] [bt] (5) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(+0x11a6a47) [0x7fef40a08a47] [bt] (4) /home/hadoop-hdp/cephfs/data/yuweilong/dl-benchmark/models/MobilenetV1/deploy_lib.tar.so(fused_nn_conv2d_19+0x2fe) [0x7fef30261d0e] [bt] (3) /home/hadoop-hdp/cephfs/data/yuweilong/dl-benchmark/models/MobilenetV1/deploy_lib.tar.so(+0x7168) [0x7fef30262168] [bt] (2) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(TVMBackendGetFuncFromEnv+0x60) [0x7fef40a1c970] [bt] (1) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(tvm::runtime::ModuleNode::GetFuncFromEnv(std::__cxx11::basic_string<char, std::char_traits, std::allo$ ator > const&)+0x3da) [0x7fef40a1701a] [bt] (0) /home/hadoop-hdp/cephfs/data/yuweilong/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7fef401f8662] File “/home/hadoop-hdp/cephfs/data/yuweilong/tvm/src/runtime/module.cc”, line 123 File “/home/hadoop-hdp/cephfs/data/yuweilong/tvm/src/runtime/library_module.cc”, line 91 TVMError: Check failed: ret == 0 (-1 vs. 0) : Check failed: f != nullptr: Cannot find function tvm.contrib.cudnn.conv2d.forward in the imported modules or global registry

2、If I use the latest code, the older log is no longer available?

Yes, because of asymmetric padding pr I mentioned. The padding becomes 4D. You should be better to re-tune.

And also, I run the latest version to generate a model with the pre-defined configuration in tophub, it crashed.

The error should be you don’t enable cudnn to build TVM.

Well, I don’t know why cudnn is called.

I’m using the following codes:

When I compile mobilenet without auto tuning , I can get the pre-defined config. This code is ok with older version of tvm ,and I can get 720 fps mobilenet.

Anyway , I’d better use V0.6.0 for now. It’s too hard for me to debug these problems, I can’t quite understand tvm well now.