Tune Relay for Intel Graphics

sebap · April 17, 2019, 8:48am

Hi,

I’ve modified tune_relay_cuda tutorial example for intel_graphics (OpenCL) target. You can find it on my GitHub.

I thought that it should be simple and straight forward. Unfortunately, it’s not. I am successful in running it, but results are useless. First of all when running tuning the progress stops on 1. No matter how many trials I choose. Output looks something like that:

[Task 1/12] Current/Best: 58.42/ 58.42 GFLOPS | Progress: (1/20) | 3.50 s Done.
[Task 2/12] Current/Best: 65.10/ 65.10 GFLOPS | Progress: (1/20) | 3.39 s Done.
[Task 3/12] Current/Best: 54.75/ 54.75 GFLOPS | Progress: (1/20) | 3.81 s Done.
[Task 4/12] Current/Best: 71.56/ 71.56 GFLOPS | Progress: (1/20) | 3.38 s Done.

Logs seems to be ok: INFO:autotvm:Get devices for measurement successfully!

Additionally, performance with and without autotvm seems to be exactly the same. And finally, I don’t get info that autotvm cannot find config for the target.

Could you please correct me, how such tuning file for OpenCL should work?

sebap · April 16, 2019, 4:01pm

One indicator is problem with this block in tune_tasks:

    for i in range(len(tasks)):
        try:  # try winograd template
            tsk = autotvm.task.create(tasks[i].name, tasks[i].args,
                                      tasks[i].target, tasks[i].target_host, 'winograd')
            input_channel = tsk.workload[1][1]
            if input_channel >= 64:
                tasks[i] = tsk
        except Exception:
            pass

For intel_graphics target tsk is None. I am not exactly sure why is it happening.

As I understand, the workload is required later and is my understanding correct that workload is somehow connected to winograd? Can I use winograd template with OpenCL at all?

eqy · April 17, 2019, 1:28am

Currently, for tuning to work to be useful the topi definition must be a template that defines a schedule search space, not a concrete schedule with hardcoded parameters. The current intel graphics schedules use the older hardcoded style: https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py. If the parameters are hardcoded, then there is only one configuration in the search space. We welcome contributions if you want to take a stab at lifting the existing schedule to a search space :). You can look at other OpenCL operators in topi to see what that looks like.

merrymercy · April 17, 2019, 8:52am

We don’t support tuning for intel graphics target right now. You could try to use “opencl” only in the target string.

target = tvm.target.create('opencl')

Then it will use the templates for cuda to tune. These templates also work for opencl. Hopefully, you can get some better performance.

sebap · April 17, 2019, 10:53am

@merrymercy thanks for suggestion, but performance with this approach is much worse than without any change.

My guess is that I’ll need to understand how to add tuning for Intel Graphics to see any reasonable numbers.

sebap · April 17, 2019, 2:17pm

@eqy could you recommend how to start adding such template?

I’ve checked tune_conv2d_cuda tutorial and opt_conv_cuda (here I have problem in running it, but that’s different story). In both of those cases those only work for CUDA and just changing to intel_graphics or target.create('opencl') is not helping.

Some ideas I can see in https://github.com/dmlc/tvm/tree/master/topi/python/topi/cuda but I strongly doubt that this can be easily converted to intel_graphics, same as I cannot just change target in mentioned tutorials. For now I’m focusing on understanding conv2d_direct.py from CUDA, since this one seems to be easy.

eqy · April 17, 2019, 6:56pm

At high level, without much understanding of the schedule, you can identify the “tunable” points by looking at branches in the code. For example, if you see an if statement that depends on some input or weight shape in the operator, it is likely that this was a handpicked value that you can cover in a search space definition. If these are loop split factors, you can replace these branches with a autotvm split. You can also identify other choices that seem to be data-dependent in the schedule and lift them to the search space—check out https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html for an example tutorial.
Note that there is a little bit of boilerplate associated with registering the template with autotvm, but the CUDA schedules you are already looking should be a good reference point that you can just pattern match.
As a rule of thumb, I would start by adding the splits first, checking the correctness of the new template as more knobs are added.

sebap · April 18, 2019, 1:35pm

I started some testing and playing with different tutorials. I managed to have some template generated (based on cuda/conv2d_direct.py) that is producing some results. However, with autotvm.apply_history_best(log_file): seems not to be picking up best value. Because whatever I’ll change in my scheduler template final performance is always the same. And I can see that value is picked up by ApplyHistoryBest.

This also made me think from where in such case with autotvm.apply_history_best(log_file): takes values? The result is similar to CPU one. Is there a place where it just assumes to use CPU configuration for any scheduler when Intel Graphics is set as a target?

Another question, I asked in one of my previous posts is regarding absence of WARNING:autotvm:Cannot find config for target=... messages when running my scheduler. I feel this is related to all of that. As if my changes to scheduler completely doesn’t affect what’s being used during compilation.

sebap · April 19, 2019, 12:59pm

After some more digging I’ve figured out most of details regarding scheduling. The difficult part was to figure out how and when fallback happens to generic scheduler. Also, I noticed that different layouts were used during tuning and compilation which has been causing problem in picking trained values.

Probably now the biggest remaining question is - from where you got all of the numbers? Was it try and error approach? Or have you used some documentation. What I figured out so far is that copying scheduler template from CUDA as is without changes is pointless - numbers are really bad.

A bit follow-up to this question - from where you got t->thread_warp_size = 16;?

dollphintear · May 9, 2019, 2:27am

is there a plan on the support for intel graphics ? thanks