Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results

compute_inline sometimes will trigger a bug in measurement due to wrong initialization.

It is fixed by https://github.com/dmlc/tvm/pull/1956. Please check it out.

For large data size, one of the possible strategy is to do per segment spatial pack, i.e. you try to spatially partition your input data, and within each small block, run a spatial pack then conv2d(this will likely keep spatial pack data on cache) between packing and execution

What is the base frequency of your processor? We measured single-thread performance on Amazon EC2 C5 (Intel Skylake running at 3.0GHz), TVM with necessary layout transformation gets 65.09 ms, while the latest MXNet with MKL-DNN gets 50.87 ms.

1 Like

@ajtulloch To eliminate layout_transform op, did you keep all tile_ic and tile_oc consistent, just like current _fall_back_schedule? If that’s the case, graph tuner might further improve e2e performance by smartly balancing tradeoffs between layout transform and fast kernels.

Also did you use mxnet master branch to do MKL_DNN testing? Recently intel team introduces fusion into mxnet-mkldnn and gets performance boost.

1 Like

65.09 ms on Amazon EC2 C5 is achieved with graph tuner. With current _fall_back_schedule, the time is around 90 ms.

Could you explain more? what is the meaning of per segment? It seems that very similar NCHWc. But I haven’t understood spatially partition the input data

In before, you spatial pack everything before you run conv2d. This causes problem if input is very large, and the intermediate data do not fit into cache. What you can do is to divide your input by height and width into smaller region, and run spatial pack on each of the region sequentially

@masahi The generate schedules were surprising, in all cases V was fully materialized which was a bit weird, I suspect that could be improved by maybe a better loop ordering (which I didn’t really tune here). Another thing to try was the minimal methods (e.g. https://github.com/ajtulloch/tvm/blob/4cebfb323b72dfc7d0f4d5573ea28d87dd99ad24/tensorize/simple_winograd.py#L269-L345) which might allow something like F(6x6, 3x3) to be faster in this impl (4x4 seems to be almost always faster here), but the massive amount of unrolling required is pretty annoying - perhaps an explicit Mul + Transpose + Mul would be better and only need alpha^2 unrolling instead of alpha^4 unrolling.

Thanks @merrymercy, I’ll check that out. I didn’t notice any surprises going from AutoTVM to e2e, but I’ll rebase, perhaps I might see faster convergence or similar.

I noticed some interesting design questions during this work, around cases where modifying autotvm knobs can change the cache footprint of the algorithm. For example, tuning the compute_at location, or precomputing winograd transforms.

If during AutoTVM tuning, we allow a program to fully occupy the cache, and don’t flush cache between iterations, we can (naturally) seek solutions that fully occupy the cache. This has problems then in e2e testing, since arrays that we assume exist in cache (ie. weights) are evicted during e2e runs, which leads to lower performance (one case is e.g. F(6x6, 3x3) winograd, where our weight footprint increases by a factor of (8 / 3)^2.

I can think of a few candidate approaches to work around this, such as:

  1. During AutoTVM profiling, evict everything from cache between iterations, to get a pessimistic lower bound
  2. During AutoTVM profiling, evict just the activations from cache between iterations, to get a perhaps less pessimistic bound
  3. Enable Intel CAT or similar to restrict the amount of cache available during AutoTVM profiling.
  4. Examine the schedule to explicitly compute the amount of cache traffic generated (I believe with some assumptions you can do this analytically), and use this as part of an overall cost metric.
  5. Perhaps with the graph tuning stuff, use a full e2e evaluation process to select between a restricted subset of algorithms (with different cache effects), to avoid being too greedy too early at the local stage?

Curious what you folks think, or if you’ve run into similar problems before?

@yidawang this is a 2GHz base clock GCE Skylake, synthetic GEMM benchmarks (e.g. https://github.com/Mysticial/Flops) hit about 118GFLOP/s. This was using the mxnet-mkl=1.3.0 package from PyPI. I’ll re-run with the master MXNet version.

@kevinthesun this was with mxnet-mkl=1.3.0 from PyPI, so I haven’t rebuilt from master. Good idea to try, thanks. WRT layout_transform, I did indeed fix tile_ic/tile_oc at the native vector width. graph tuner would definitely help, that’s on our list to try, thanks for that work. I think there’s also an interesting opportunity (as I mentioned above in my reply to @merrymercy) to leverage graph tuner to select between different algorithm instantiations with different cache behaviour, to better model the e2e occupancy of weights, etc (for example, while we hit ~200GFLOP/s for the Winograd kernel on e.g. a 512 -> 512 3x3 conv vs ~105GFLOP/s for the direct kernel, it turns out to be faster to use the direct kernel e2e because of the much larger cost of loading the pre-transformed Winograd weights).

We have run into similar problems when trying winograd on some arm cpus.
We also suspected the reason is the increase of weight tensor.

I think (1 or 2) + 5 can be the solution.
We can evaluate 1 and 2 and pick the better one to fix our profiling.
Something like 5 is orthogonal to 1or 2. We can always add that as a final pass.

@ajtulloch can you open an issue in the tvm repo with checklists so that we can track the process and also get all the developers informed?

@tqchen sure - I sent out a WIP PR for NCHWc x86 Winograd in https://github.com/dmlc/tvm/pull/2111. That is the main incremental contribution from this post.

1 Like

In before, you spatial pack everything before you run conv2d. This causes problem if input is very large, and the intermediate data do not fit into cache. What you can do is to divide your input by height and width into smaller region, and run spatial pack on each of the region sequentially

How about padding? I worry about if we divide into segments, we will do padding wrongly.

What’s more, could you give me some code ref how do I divide the data by height / width, then run spatial_pack / conv2d? Thanks in advance.

The rough idea is divide the code into conv2d with smaller input regions. Imagine you are doing conv2d on 224x224 input, you can divide it into four 112x112 conv2d. Then iterative 2x2 in the outer loop.

Then each small workload is a 112x112 and we can reuse the spatial pack data for that

@tqchen Thanks for replying. what you mean is we don’t change any spatial pack logic, Just change the logic of

tvm.compute(oshape, lambda n, c, h, w: tvm.sum(data_vec[n][c][h][w] * kernel[n][c][h][w])?

Out graph infer_shape pass will check the output shape, so we could not change it simply from my view.

So could you leverage this code to express your thiught? Maybe code could express more directly. Thanks

I have a question about the packing of the input. In the unet_conv2d_NCHWc_direct.py:90, the shape of packed input is determined. In my understanding, we organize the input tensor where the input region corresponding to a ‘VH*VW’ spatial region in the output tensor is packed. If this is the situation, I think the dimension should be (KH+HSTR*(VH-1), KW+WSTR*(VW-1)) instead of (VH*HSTR + KH-1, VW*WSTR + KW-1).
Could you please help me about this question?

You are right. Would you mind to send a patch?

But I believe the final generated code will be the same. Because the current code always uses a larger shape, and the bound inference part in tvm compiler will correct it to the minimal required shape.

Hi, thansk for your reply. I read the PR(https://github.com/dmlc/tvm/pull/2111/files) but I found the direct convolution does not use spatial packing(topi/python/topi/x86/conv2d.py L434:L498).