autotvm.RPCRunner and TVM_NUM_THREADS

apivovarov · July 30, 2019, 1:57am

I noticed that tune_relay_x86.py tutorial recommends to set TVM_NUM_THREADS env variable equal to number of physical cpu cores on your machine to speedup the tuning.

github.com

dmlc/tvm/blob/786c49f36da2368829667c51758fe4b9017dddbd/tutorials/autotvm/tune_relay_x86.py#L96


log_file = "%s.log" % model_name
graph_opt_sch_file = "%s_graph_opt.log" % model_name


# Set the input name of the graph
# For ONNX models, it is typically "0".
input_name = "data"


# Set number of threads used for tuning based on the number of
# physical CPU cores on your machine.
num_threads = 1
os.environ["TVM_NUM_THREADS"] = str(num_threads)




#################################################################
# Configure tensor tuning settings and create tasks
# -------------------------------------------------
# To get better kernel execution performance on x86 CPU,
# we need to change data layout of convolution kernel from
# "NCHW" to "NCHWc". To deal with this situation, we define
# conv2d_NCHWc operator in topi. We will tune this operator
# instead of plain conv2d.

Also, x86 tutorial uses autotvm.LocalRunner, not RPC

Is it possible to use multiple threads with autotvm.RPCRunner?
if yes, then how exactly to do it?

tico · July 30, 2019, 6:13am

Hi,

I have an associated question. I have seen that besides being used to speed tuning it impacts the inference time. Actually, I am wondering what is the default value?.

The problem is that I have been getting measurements in a quad core machine but the goal is to use one single core in the final deployment. However, apparently the initial measurements I did reflect multicore performance but I was not aware of that.

The point here is that this flag seems not only used to speedup tuning by itself but actually changes the number of cores used during inference in the runtime.

In addition, I am wondering if is fine to use more cores during tuning and less cores for deployment while the tuning results are still valid for less cores?

kevinthesun · July 30, 2019, 6:11am

The most straight forward way is to set this env var on the remote machine.

kevinthesun · July 30, 2019, 6:15am

Tuning on multi-cores and deploy on less # of cores will still achieve descent performance. However, sometimes tuning on a single core will give better performance for deploying on small number of cores, than tuning on multiple-cores. The performance difference usually won’t be more than 10%.

tico · July 30, 2019, 6:21am

So you mean that I should tune using the actual number of cores used for the deployment or using a single core to get the best performance results?

kevinthesun · July 30, 2019, 6:39am

With more time budget, you can tune with single core, which should give no less performance than on multi-core.

apivovarov · July 30, 2019, 9:59pm

Do you mean I need to set TVM_NUM_THREADS before running RPC server on edge device?

export TVM_NUM_THREADS=4
python3 -m tvm.exec.rpc_server --tracker=... --key=... --no-fork

Do I also need to export TVM_NUM_THREADS on main host where I run autotune code (tune_kernel, tune_graph)?

kevinthesun · July 30, 2019, 10:16pm

You don’t need to export on host, just on remote.

cbalint13 · July 31, 2019, 12:18am

On remote:

I think on remote edge (RPC) is nonsense to set TVM_NUM_THREADS (by my logic) it doesn’t help. Also, i can’t see anywhere in the RPC code. It receives one sample kernel test it (using multicore CPU or GPU) then send metering results back. Can’t see what can be parallel on RPC side (either in the code). The kernel under test itself may be run parallelized, but only one kernel (test case) will run at once on edge.
If one want parallel searching on remote RPC then have to use multiple physical edges, each registered to the tracker will receive at same time test kernels, thus N edges yields (Time / N) shortage.

On host:

Yes it matters a lot. It can be observed during xgboost steps (internal xgb feature re-processing):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
19199 cbalint   20   0   24.2g 700932 112244 R 181.8   4.3   0:30.97 tune-mali.py                                                                   
19198 cbalint   20   0   24.2g 700808 112120 R 154.5   4.3   0:28.81 tune-mali.py                                                                   
19197 cbalint   20   0   24.2g 700932 112244 R 136.4   4.3   0:31.74 tune-mali.py                                                                   
19193 cbalint   20   0   24.2g 700796 112108 R 127.3   4.3   0:29.89 tune-mali.py                                                                   
19195 cbalint   20   0   24.2g 700916 112228 S 127.3   4.3   0:30.40 tune-mali.py                                                                   
19194 cbalint   20   0   24.2g 700872 112184 S  18.2   4.3   0:29.98 tune-mali.py                                                                   
19196 cbalint   20   0   24.2g 700924 112236 S   9.1   4.3   0:33.25 tune-mali.py

I think tutorial sets it to 1 (safe demo for any target).

apivovarov · August 1, 2019, 2:10am

Setting TVM_NUM_THREADS before starting RPC server on edge device does not affect tuning process speed. But setting TVM_NUM_THREADS affects model evaluation time (ms).

Below are 3 examples of tuning with different TVM_NUM_THREADS - None, 1 , 4.
As you can see the time metrics are the same - about 122 sec

#TVM_NUM_THREADS - time
unset - [Task  1/20]  Current/Best:    0.42/   1.95 GFLOPS | Progress: (48/88) | 121.26 s
1     - [Task  1/20]  Current/Best:    0.73/   1.45 GFLOPS | Progress: (48/88) | 122.30 s
4     - [Task  1/20]  Current/Best:    1.07/   2.30 GFLOPS | Progress: (48/88) | 121.97 s

But setting TVM_NUM_THREADS before running RPC server makes the difference for Model Evaluation:

#TVM_NUM_THREADS - time
unset - Mean inference time (std dev): 203.02 ms (0.05 ms)  (top + 1 shows that runtime uses 2 cores out of 4)
1 - Mean inference time (std dev): 394.82 ms (0.03 ms) 
4 - Mean inference time (std dev): 104.85 ms (0.03 ms)

FrozenGene · August 1, 2019, 2:13am

In the auto measure_option, maybe we should add one option named as thread_mod, which call runtime.config_threadpool, and we could control how many cores and big / little cpu core. CC @merrymercy

cbalint13 · August 1, 2019, 2:48am

@apivovarov,

You are right.

Not looked into exported shared object on remote but seems that the runtime is TVM_NUM_THREAD dependent.
Thats strange, i was thinking that kernels are exported compiled/configured with thread_num preset on it (sounds the most logic after me) thus it expands alone on remote multi-cores.
I was training a lot on opencl (remote) that i think is not influenced by TVM_NUM_THREAD on GPU.

@FrozenGene,

Yes it is a good idea to add such explicit option, the TVM_NUM_THREAD on remote side, is confusing.
Or perhaps num_core could be extracted/queried by a small code/function before (cpuinfo like).

FrozenGene · August 1, 2019, 3:04am

Hmm…I think extract how many num_cores before doesn’t have too much significance. Because when we doesn’t set thread_mod, it will use all cores. Then users could set thread_mod be (kBig, 2)(Note, kBig is 1, kLittle is -1 as described in runtime.config_threadpool, imagine our board has 2xA72 + 4xA53), then AutoTVM will use 2 big cpus(2xA72). if set (kLittle, 1), AutoTVM will use 1 little cpu (1xA53). num_cores could just be used for checking the number of setting make sense or not.

cbalint13 · August 1, 2019, 3:34am

See now:

https://github.com/dmlc/tvm/issues/1541 .

And the example how to handle from local the remote:

Number of threads used during auto-tunning

XinchengHan · November 28, 2019, 3:20am

Does autoTVM have any interface to config core/thread_num on remote? (Just as @FrozenGene describes)

If not, will autoTVM runner randomly choose remote core to get measure result? When there are two different cores on remote (ie. A73 & A53), the best config given by autoTVM will not be what we want since the measure result may be mixed from two types of cores.

tico · November 28, 2019, 8:31am

I would also be interested to know if there is any way to set the number of threads when autoTVM is used in RPC on a remote target. Any ideas @FrozenGene?

XinchengHan · November 28, 2019, 8:38am

One available way is setting them in run_through_rpc(@measure_methods.py) with runtime.config_threadpool.

But task-level parallelism will be affected and the tuning process will be slow.

tico · November 28, 2019, 8:54am

What we want to achieve is to tune for the target number of cores to be used in the target, which is not necessary all available cores on the target.

Of course, I understand that by doing this the process will be slower, but we dont want to optimize for lets say 4 cores on an ARM target and then for the final deployment we use only 1 core as given in a requirement for example.