autotvm.RPCRunner and TVM_NUM_THREADS


#1

I noticed that tune_relay_x86.py tutorial recommends to set TVM_NUM_THREADS env variable equal to number of physical cpu cores on your machine to speedup the tuning.

Also, x86 tutorial uses autotvm.LocalRunner, not RPC

Is it possible to use multiple threads with autotvm.RPCRunner?
if yes, then how exactly to do it?


#2

Hi,

I have an associated question. I have seen that besides being used to speed tuning it impacts the inference time. Actually, I am wondering what is the default value?.

The problem is that I have been getting measurements in a quad core machine but the goal is to use one single core in the final deployment. However, apparently the initial measurements I did reflect multicore performance but I was not aware of that.

The point here is that this flag seems not only used to speedup tuning by itself but actually changes the number of cores used during inference in the runtime.

In addition, I am wondering if is fine to use more cores during tuning and less cores for deployment while the tuning results are still valid for less cores?


#3

The most straight forward way is to set this env var on the remote machine.


#4

Tuning on multi-cores and deploy on less # of cores will still achieve descent performance. However, sometimes tuning on a single core will give better performance for deploying on small number of cores, than tuning on multiple-cores. The performance difference usually won’t be more than 10%.


#5

So you mean that I should tune using the actual number of cores used for the deployment or using a single core to get the best performance results?


#6

With more time budget, you can tune with single core, which should give no less performance than on multi-core.


#7

Do you mean I need to set TVM_NUM_THREADS before running RPC server on edge device?

export TVM_NUM_THREADS=4
python3 -m tvm.exec.rpc_server --tracker=... --key=... --no-fork

Do I also need to export TVM_NUM_THREADS on main host where I run autotune code (tune_kernel, tune_graph)?


#8

You don’t need to export on host, just on remote.


#9

On remote:

  • I think on remote edge (RPC) is nonsense to set TVM_NUM_THREADS (by my logic) it doesn’t help. Also, i can’t see anywhere in the RPC code. It receives one sample kernel test it (using multicore CPU or GPU) then send metering results back. Can’t see what can be parallel on RPC side (either in the code). The kernel under test itself may be run parallelized, but only one kernel (test case) will run at once on edge.
  • If one want parallel searching on remote RPC then have to use multiple physical edges, each registered to the tracker will receive at same time test kernels, thus N edges yields (Time / N) shortage.

On host:

  • Yes it matters a lot. It can be observed during xgboost steps (internal xgb feature re-processing):
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
19199 cbalint   20   0   24.2g 700932 112244 R 181.8   4.3   0:30.97 tune-mali.py                                                                   
19198 cbalint   20   0   24.2g 700808 112120 R 154.5   4.3   0:28.81 tune-mali.py                                                                   
19197 cbalint   20   0   24.2g 700932 112244 R 136.4   4.3   0:31.74 tune-mali.py                                                                   
19193 cbalint   20   0   24.2g 700796 112108 R 127.3   4.3   0:29.89 tune-mali.py                                                                   
19195 cbalint   20   0   24.2g 700916 112228 S 127.3   4.3   0:30.40 tune-mali.py                                                                   
19194 cbalint   20   0   24.2g 700872 112184 S  18.2   4.3   0:29.98 tune-mali.py                                                                   
19196 cbalint   20   0   24.2g 700924 112236 S   9.1   4.3   0:33.25 tune-mali.py                                                                                  

I think tutorial sets it to 1 (safe demo for any target).


#10

Setting TVM_NUM_THREADS before starting RPC server on edge device does not affect tuning process speed. But setting TVM_NUM_THREADS affects model evaluation time (ms).

Below are 3 examples of tuning with different TVM_NUM_THREADS - None, 1 , 4.
As you can see the time metrics are the same - about 122 sec

#TVM_NUM_THREADS - time
unset - [Task  1/20]  Current/Best:    0.42/   1.95 GFLOPS | Progress: (48/88) | 121.26 s
1     - [Task  1/20]  Current/Best:    0.73/   1.45 GFLOPS | Progress: (48/88) | 122.30 s
4     - [Task  1/20]  Current/Best:    1.07/   2.30 GFLOPS | Progress: (48/88) | 121.97 s

But setting TVM_NUM_THREADS before running RPC server makes the difference for Model Evaluation:

#TVM_NUM_THREADS - time
unset - Mean inference time (std dev): 203.02 ms (0.05 ms)  (top + 1 shows that runtime uses 2 cores out of 4)
1 - Mean inference time (std dev): 394.82 ms (0.03 ms) 
4 - Mean inference time (std dev): 104.85 ms (0.03 ms) 

#11

In the auto measure_option, maybe we should add one option named as thread_mod, which call runtime.config_threadpool, and we could control how many cores and big / little cpu core. CC @merrymercy


#12

@apivovarov,

You are right.

  • Not looked into exported shared object on remote but seems that the runtime is TVM_NUM_THREAD dependent.
  • Thats strange, i was thinking that kernels are exported compiled/configured with thread_num preset on it (sounds the most logic after me) thus it expands alone on remote multi-cores.
  • I was training a lot on opencl (remote) that i think is not influenced by TVM_NUM_THREAD on GPU.

@FrozenGene,

  • Yes it is a good idea to add such explicit option, the TVM_NUM_THREAD on remote side, is confusing.
  • Or perhaps num_core could be extracted/queried by a small code/function before (cpuinfo like).

#13

Hmm…I think extract how many num_cores before doesn’t have too much significance. Because when we doesn’t set thread_mod, it will use all cores. Then users could set thread_mod be (kBig, 2)(Note, kBig is 1, kLittle is -1 as described in runtime.config_threadpool, imagine our board has 2xA72 + 4xA53), then AutoTVM will use 2 big cpus(2xA72). if set (kLittle, 1), AutoTVM will use 1 little cpu (1xA53). num_cores could just be used for checking the number of setting make sense or not.


#14

See now:

And the example how to handle from local the remote: