[RFC][Tensor Core] Optimization of Winograd conv2d on Tensor Core

Introduction

We optimized the Winograd algorithm of conv2d for Tensor Core with NHWC layout. There are four modules in winograd algorithm: feature map transform, kernel transform, inverse transform, and batched gemm (bgemm).

Following major functions were added:

1, Conv2d_nhwc_winograd_tensorcore: In this module, bgemm is implemented on Tensor Core. Both Fp16 and Fp32 inputs and outputs are supported.

2, Conv2d_nhwc_winograd_direct: In this module, bgemm is implemented by direct method without Tensor Core. Winograd algorithm switches to this module when input shapes are not supported by Tensor Core.

3, Conv2d_nhwc_winograd_without_weight_transform: A module that implemented kernel transform with NHWC layout.

We acknowledge Siyuan Feng @Hzfengsy for his discussions and advices on the optimizations.

Tricks of optimization on Winograd

In the modules of data transform, kernel transform, and inverse transform, the shared memory is added to rearrange the data, which can facilitate coalesced memory access on GPU.

Vectorized data loading is used in bgemm, and the offsets in shared memory were auto tuned by autoTVM to avoid bank conflicts.

Performance

The benchmarks below were running on T4 GPU (16GB, 70W). Latency is reported with unit of ms.

batch size Winograd(Original) Winograd (TensorCore) conv2d (Tensorcore)
8 0.3039 0.1829 0.2968
16 0.5013 0.264 0.3495
32 0.944 0.4506 0.4852
256 7.8573 3.9432 2.8367

Table 1. 3x3 convolution with shape of 3x3x256x256. Shape of input feature maps is 14x14x256. Note: The layout in Winograd (original) is NCHW, while others are NHWC.
Note: conv2d (TensorCore) means running unit test and Resnet50 benchmark with Tensor Core enabled conv2d. Winograd (TensorCore) means running unit test and Resnet50 benchmark with Tensor Core enabled winograd.

batch size Winograd (Original) Winograd (TensorCore) conv2d (Tensorcore)
8 0.2938 0.1979 0.3022
16 0.6412 0.3453 0.3771
32 1.185 0.6613 0.5431
256 9.5977 5.5837 3.8693

Table 2. 3x3 convolution with shape of 3x3x64x64. Shape of input feature maps is 56x56x64.

We can see from table 1 and table 2 that winograd with tensor core outperforms original winograd algorithm for all the batchsizes, and the speedup is in the range of [1.5, 2.1]. However, performance of winograd is worse than conv2d for large batchsizes when Tensor Core were enabled for both.

batch size Conv2d (TensorCore) Winograd (Tensorcore) SpeedUp
8 11.99 9.59 1.250
16 16.29 13.76 1.184
32 22.84 22.89 0.998
256 148.85 166.59 0.893

Table 3. Resnet50 on T4.
Note: Only 3x3 convolutions were running with winograd algorithm.

Table 3 presents performance of resnet50 on Tensor Core with/without Winograd algorithm. The performance improvements are quite good for batchsize of 8 and 16 when Winograd algorithm is used. However, the performance of Winograd is worse than conv2d for batchsize 32, and 256.

Limitations

1, The loops of Winograd bgemm were split into blocks that were consumed by Tensor Core. The dimensions of the blocks that feed into Tensor Core were relating to P(number of rows in the converted feature map matrix), input channel, and output channel. Input shapes of Tensor Core units must be (8, 16, 32), (16, 16, 16) or (32, 16, 8) for fp16, hereafter marked as (t1, t2, t3). Current implementation requires P, input channel, and output channel must be divided by t1, t2, and t3, respectively.

Open questions:

1, Optimizations of Winograd on Tensor Core works well for small batchsizes like 8 or 16. How to improve performance for large batchsize?

8 Likes

For winograd impl with large batch size, maybe you can refer to this paper https://dl.acm.org/doi/pdf/10.1145/3332466.3374520.
They implement an assembler for Volta/Turing architecture and use CHWN layout for large batch winograd algorithm.

1 Like

Hi xiaocenxiaocen,

Thanks. I will follow up this paper.

Best wishes,
Shawn Wu

Hi Shawn_Inspur, This RFC does not support int8, How should I make it work in int8? Thanks