How does AutoTVM distribute resources for tuning and runtime measurements on one single CPU?

Can anyone introduce me to how AutoTVM runs when it’s auto-tuning an operator on a CPU? Even better pointing me to any documentation if there’s one.

Specifically, my questions are:

  1. By default does the xgboost python library run on one single CPU core? W/ or w/o multithreading? Is there any way to speed up the auto-tuning process by e.g. building from source with libraries like OpenMP? (I’m using Ubuntu 16.04.)

  2. When AutoTVM is tuning an operator, how is the CPU usage shared between training and kernel runtime measurements? Does it affect the measurement accuracy if training and measurements are done on the same CPU chip as in that case the measurement job doesn’t get the full resources? Should I always set up the RPCtracker so that training is done on a separate host CPU and measurements are done on another one?

Thanks in advance?

1 Like

By default does the xgboost python library run on one single CPU core? W/ or w/o multithreading? Is there any way to speed up the auto-tuning process by e.g. building from source with libraries like OpenMP? (I’m using Ubuntu 16.04.)

XGBoost will use multiple threads to train the model so training usually takes a very short time. The real bottleneck is measurement, because AutoTVM sequentially measures the performance of every config several times to reduce the measurement error.

When AutoTVM is tuning an operator, how is the CPU usage shared between training and kernel runtime measurements? Does it affect the measurement accuracy if training and measurements are done on the same CPU chip as in that case the measurement job doesn’t get the full resources? Should I always set up the RPCtracker so that training is done on a separate host CPU and measurements are done on another one?

You can definitely set an RPC tracker to separate the building and measurement, but it seems not really necessary based on my experience. AutoTVM will build a batch (default size 8) of configs with multiple threads, and evaluates their performance sequentially. The process of compilation and evaluation is not pipelined so there’s no resource conflict issue. Plus, since a model inference runs on an CPU usually cannot claim all CPU resources, so I don’t think isolating an CPU for measurement during the tuning process would give you better accuracy.

Thank you so much! It’s a very clear answer to me.

As you mentioned multithreading here, could you also take a look at another question I posted? AutoTVM repetitive outputs. I wonder if it’s because of the multithreading.