Questions on auto-tuning

yvn · February 22, 2019, 7:42am

Hi, I had tried tuning the network for tflite model, and I have certain questions:

The tasks that are being extracted are the operation between the data and corresponding kernels correct?
In every iteration one task is sent for tuning and the logfile is filled with the config and it’s performance?
So currently tuning is for intra-layer and in one iteration of the tasks only one task is getting tuned?
But has anybody tried doing interlayer tuning? I mean scheduling the operations between the layers. This could give more benefit I guess. It might be beneficial in the case where the dimension of data is not too much but there are a number of layers in the network.
thanks

eqy · February 22, 2019, 9:16pm

Kernels will operate on data, yes.
Yes.
Yes.
There are some early efforts at graph-level optimizations (mainly data layout such as NCHWc[X]) for x86. This is very tricky as it is likely to be very expensive given our current approach and also requires a reasonable definition of a graph-level search space.

yvn · February 23, 2019, 8:21am

Where can I find more information about this datalayout NCHWc[X], because during tuning when I saw the implementation of the optimized schedules in python in topi inventory, I could not understand much from the code as to why the data from NCHW is transformed into NCHWc[X]?
Thanks!

eqy · February 23, 2019, 8:35am

If you are using an ARM cpu target, you should not be affected by NCHWc, as this is not in the ARM cpu topi. However, if you are interested in why data layout in general may be useful, some relevant reading:
https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#array-packing

yvn · February 25, 2019, 6:13am

thanks, will go through it!

yvn · March 8, 2019, 5:49am

@eqy with reference to inter layer optimization, are we able to send 2 tasks at a time to the tuner, because then we would be able to use compute-inline or compute_at operation for better cache utilization.

eqy · March 8, 2019, 11:15pm

In general, inter-layer optimization is tricky and can take on many forms. In this case, are you referring to fusing two layers together as a single kernel? This type of optimization is very difficult in the case of two conv layers because the layers themselves may have very different ideal scheduling/threading patterns—requiring a global synchronization step before the next conv layer can run. (How do we know when the input data required for the next’s layers op is ready, and how would we communicate this at the granularity of individual CUDA threads or OpenCL work items?) In general for most CUDA hardware devices and all OpenCL devices I know of, this global synchronization step is not supported, and practically the only way to achieve this synchronization is via a separate kernel invocation or separate tasks.

For simpler operators, we already do fusion e.g., doing conv2d, batchnorm, and relu all in the same kernel.

yvn · March 9, 2019, 2:16am

Won’t it just be like a producer consumer problem. Say we have a tensor t1 and kernel k1 and tensor t2 and kernel k2. Some node value of t2 will be equal to some operation on t1. So, it can written as
t1=conv(k1, previous_layer)
t2=conv(k2, t1), so till the required nodes of t1 are being calculated, t2 won’t be calculated. We can force that using compute_at. Ofcourse compute_all will be same as the work already being done now(task wise) this is how we can achieve synchronization.Please correct me if I am wrong

eqy · March 11, 2019, 6:16am

You are welcome to try this and report the results