OpenCL scheduler: Kernel_vec related question

Ajja · June 13, 2019, 2:14pm

Hi, @Laurawly I am trying to understand OpenCL’s scheduler. I’ve read OpenCL AutoTVM questions but still cannot figure out a couple of stuff. I’d appreciate some answers:

In nchw scheduler, do you convert the kernel to NCHW16c?
Why is kernel, in kernel_vec operation, divided into blocks by first axis, num_filters and not second one, channel?
What’s the purpose of channels in convolution operation? When could it be useful? (in python/relay/op/nn.py)
How was this if statement created? Are these numbers from a specific iGPU? If yes, which one?

block_w = 1
block_h = 1
if stride_h == 2:
    if num_filter + kernel_h == 515:
        block_h = 4
        block_w = 4
    else:
        block_h = 4
        block_w = 5
elif kernel_h == 3:
    if num_filter == 512:
        block_h = 2
        block_w = 7
    else:
        block_h = 2
        block_w = 14
elif kernel_h == 7 and padding == 3 and stride == 1:
    block_h = 3
    block_w = 4
else:
    block_h = 1
    block_w = 16

Ajja · June 17, 2019, 1:46pm

I’d really appreciate some answers.
Thanks.

Laurawly · June 17, 2019, 4:04pm

HI @Ajja,

In nchw scheduler, do you convert the kernel to NCHW16c?: I used TVM graph tuner here: https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L55 to pick the best NCHWxC for me.
In kernel vec the first axis is output channel.
I’m not quite clear of your question. Do you mean the role channel plays in convolution? Like explained here? (http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html)
The statement was created for intel graphics HD 530. But we tested it and it turns out to work well to other intel graphics HD as well.

Ajja · June 25, 2019, 7:48am

Thank You, @Laurawly very much for the answer. Based on your post, I got a couple more questions:

I think we were talking about different parts of the code. My concern about NCHW16c was in this part.

github.com

dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L447


if not out_width % block_w == 0:
    c_w = (out_width // block_w + 1) * block_w


if not out_height % block_h == 0:
    c_h = (out_height // block_h + 1) * block_h


pad_before = [0, 0, pad_top, pad_left]
pad_after = [0, 0, pad_down + c_h - block_h, pad_right + c_w - block_w]
temp = pad(data, pad_before, pad_after, name="pad_temp")


nv = 16
if not num_filter % nv == 0:
    num_filter = (num_filter // nv + 1) * nv
    out_channel = num_filter


cshape = (batch, out_channel // nv, c_h, c_w, nv)
kvshape = (num_filter // nv, channel, kernel_h, kernel_w, nv)


kernel_vec = tvm.compute(
    kvshape,
    lambda co, ci, kh, kw, vc:

I thought that by dividing the out_channel by nv you are trying to convert the data to nchwc format but now I see that you are dividing output channels. Are you creating subgroups in this part? But why are you dividing this in compute, not the scheduler using split method? What difference does it make to divide this there?

I read your comment about alter_layout function and don’t quite understand how it works. It isn’t used in AutoTVM because it is enabled only when opt_level = 3, is it?

github.com

dmlc/tvm/blob/master/python/tvm/autotvm/task/dispatcher.py#L94


----------
target: Target
    The current target
workload : Workload
    The current workload.
cfg : ConfigSpace
    The specific configuration.


Note
----
This interface is for cases when TVM decides to replace an operator in the graph.
For example, `AlterOpLayout` pass (enables when `opt_level = 3`) replaces `NCHW`
convolution with `NCHW[x]c` implementation on x86 CPUs.
Thus in TOPI, we first query schedule using original `NCHW` workload,
then update the dispatcher with the new `NCHW[x]c` workload.
So that later on, `NCHW[x]c` convolution can get schedule from the dispatcher using
its own workload directly.


.. code-block:: python


    @conv2d_alter_layout.register("cpu")

So, do you use it to replace conv2d’s compute (e.g.https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L323 ) with https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L55 ? If yes, then, how does this alter_layout method exactly work? Does it contain some implicit conversion of the input from certain data layout to nchwc and conversion back to that data layout?

Laurawly · June 27, 2019, 9:27pm

I’m not trying to create subgroups here. It’s better to reference https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L163 which uses alter layout for graph tuning. The code you referenced was a default one to split the output channel.
Yeah, it’s only enabled with opt_level >= 3. It uses implicit conversion as you define. Such as https://github.com/dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L75. Then it searches for the best combination. @yzhliu could give more detailed instructions on how this works.

Ajja · July 3, 2019, 2:16pm

Thank You, @Laurawly, for the answer. Could you also explain to me why block_h and block_w is used to add more bottom and right padding?

github.com

dmlc/tvm/blob/master/topi/python/topi/intel_graphics/conv2d.py#L444


c_h = out_height
c_w = out_width


if not out_width % block_w == 0:
    c_w = (out_width // block_w + 1) * block_w


if not out_height % block_h == 0:
    c_h = (out_height // block_h + 1) * block_h


pad_before = [0, 0, pad_top, pad_left]
pad_after = [0, 0, pad_down + c_h - block_h, pad_right + c_w - block_w]
temp = pad(data, pad_before, pad_after, name="pad_temp")


nv = 16
if not num_filter % nv == 0:
    num_filter = (num_filter // nv + 1) * nv
    out_channel = num_filter


cshape = (batch, out_channel // nv, c_h, c_w, nv)
kvshape = (num_filter // nv, channel, kernel_h, kernel_w, nv)

Is it connected with zero-padding explained here ?