Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results


@tqchen sure - I sent out a WIP PR for NCHWc x86 Winograd in That is the main incremental contribution from this post.


In before, you spatial pack everything before you run conv2d. This causes problem if input is very large, and the intermediate data do not fit into cache. What you can do is to divide your input by height and width into smaller region, and run spatial pack on each of the region sequentially

How about padding? I worry about if we divide into segments, we will do padding wrongly.

What’s more, could you give me some code ref how do I divide the data by height / width, then run spatial_pack / conv2d? Thanks in advance.


The rough idea is divide the code into conv2d with smaller input regions. Imagine you are doing conv2d on 224x224 input, you can divide it into four 112x112 conv2d. Then iterative 2x2 in the outer loop.

Then each small workload is a 112x112 and we can reuse the spatial pack data for that


@tqchen Thanks for replying. what you mean is we don’t change any spatial pack logic, Just change the logic of

tvm.compute(oshape, lambda n, c, h, w: tvm.sum(data_vec[n][c][h][w] * kernel[n][c][h][w])?

Out graph infer_shape pass will check the output shape, so we could not change it simply from my view.

So could you leverage this code to express your thiught? Maybe code could express more directly. Thanks


I have a question about the packing of the input. In the, the shape of packed input is determined. In my understanding, we organize the input tensor where the input region corresponding to a ‘VH*VW’ spatial region in the output tensor is packed. If this is the situation, I think the dimension should be (KH+HSTR*(VH-1), KW+WSTR*(VW-1)) instead of (VH*HSTR + KH-1, VW*WSTR + KW-1).
Could you please help me about this question?


You are right. Would you mind to send a patch?

But I believe the final generated code will be the same. Because the current code always uses a larger shape, and the bound inference part in tvm compiler will correct it to the minimal required shape.


Hi, thansk for your reply. I read the PR( but I found the direct convolution does not use spatial packing(topi/python/topi/x86/ L434:L498).