Few question regarding autotvm and tune_relay_x86


#1

Hi!

I am trying to understand autotvm using example of tune_relay_x86. I did some testing, some changes and based on that I have handful of questions for which I couldn’t find clear answer anywhere in the doc/sources. I’d very much appreciate any help.

  1. Is number of threads (TVM_NUM_THREADS) anyhow affecting inference (evaluation) run? Or is this value profitable only for tuning?
  2. Could someone explain what “task” mean in TVM? In tune_kernels task is one time present in prefix variable and another in comment # converting conv2d tasks to conv2d_NCHWc tasks. Documentation describes task: Task is a tunable composition of template functions. Looking at all of this, including fact that for resnet-18 there are 12 tasks, I got a bit confused how I should understand it. For example when I ran resnet-50 (just changed 18 to 50 in the script), I got 20 tasks.
  3. One of the lines produced by the script may look like this:
    [Task 12/12] Current/Best: 215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.. What are those values in Progress?
    3a. In python/tvm/autotvm/tuner/callback.py it looks like the second number might be the total number of trials. From the code I can see that this value is passed in tune method and is equal to len(task.config_space), where on each trial the first value got somehow incremented. Is it this line: ctx.ct += len(inputs)? So in that case, what exactly are inputs?
    3b. Follow-up on that, what is, or where I can found definition of config_space

Thanks for help in advance.


#2

Unfortunately some of the parameters for the runtime and tuning can be confusing, but I’ll try to clarify those here:

  1. This controls the number of threads that will be used by the kernel implementation of each operator, not the amount of parallelism used during tuning. You should set this to the number of physical cores on your machine, or the number of physical cores that you want to use to perform inference. You should observe a near-linear speedup as you increase this number.
  2. A task can be thought of as a specific instance of an operator (e.g., conv2d, dense) with a specific shape (input size, window size, stride, etc.). Even though ResNet-18 has 18 layers, some layers may be identical from this perspective, so we can have fewer “tasks” than layers. x86 is a special case as there is a data layout transformation also defined. The comment you are referring to is about converting the data layout of 12 conv2d resnet-18 tasks to 12 conv2d tasks with a different data layout (in this case NCHWc).
  3. Progress is the number of measurements tried on real hardware out of the total number of measurements allotted (this is the n_trial) parameter in tuning_option. The last value is the amount of elapsed wall-clock time while tuning this task.
    3a, see above
    3b, the configuration space (different for each hardware backend) can be found in topi/python/your_hardware_backend. In this case it would be topi/python/x86.

#3

Thanks, that helps a lot.

Probably I mixed up something in the 3a bullet, sorry for that.

What I wanted to ask was not the number, after “Progress”, but the first number in the “Progress” bracket: (number1/number2). As you already explained number2 is total number of measurements allotted. I’d like to understand, how in such case number1 is updated. From source I believe this line is responsible for that ctx.ct += len(inputs), but I don’t understand what here input means. Are those task inputs? As you explained regarding task, that this is specific instance of operator.


#4

Inputs here should refer to the number of configurations that were tried on real hardware.
At a high level, the autotuning loop looks like this:
while (true) {
propose new configurations
select top k configurations by querying the machine learning cost model
run the top k configurations on hardware and measure the performance
count (ctx.ct) += k
if ctx.ct > n_trials break
refit the cost model with new data
}
So here an input is just one configuration or variant of the kernel run on the real hardware. Usually we will try a batch of 32 or 64 different configurations together for more parallelism when we have multiple devices for measuring performance—it is not related to the inputs to the task/operator itself.


#5

Thanks! That explains it.


#6

Hello, can you tell me how can you set the target for x86 CPU on this tutorials https://docs.tvm.ai/tutorials/autotvm/tune_relay_x86.html?