We optimized the Winograd algorithm of conv2d for Tensor Core with NHWC layout. There are four modules in winograd algorithm: feature map transform, kernel transform, inverse transform, and batched gemm (bgemm).
Following major functions were added:
1, Conv2d_nhwc_winograd_tensorcore: In this module, bgemm is implemented on Tensor Core. Both Fp16 and Fp32 inputs and outputs are supported.
2, Conv2d_nhwc_winograd_direct: In this module, bgemm is implemented by direct method without Tensor Core. Winograd algorithm switches to this module when input shapes are not supported by Tensor Core.
3, Conv2d_nhwc_winograd_without_weight_transform: A module that implemented kernel transform with NHWC layout.
We acknowledge Siyuan Feng @Hzfengsy for his discussions and advices on the optimizations.
Tricks of optimization on Winograd
In the modules of data transform, kernel transform, and inverse transform, the shared memory is added to rearrange the data, which can facilitate coalesced memory access on GPU.
Vectorized data loading is used in bgemm, and the offsets in shared memory were auto tuned by autoTVM to avoid bank conflicts.
The benchmarks below were running on T4 GPU (16GB, 70W). Latency is reported with unit of ms.
|batch size||Winograd(Original)||Winograd (TensorCore)||conv2d (Tensorcore)|
Table 1. 3x3 convolution with shape of 3x3x256x256. Shape of input feature maps is 14x14x256. Note: The layout in Winograd (original) is NCHW, while others are NHWC.
Note: conv2d (TensorCore) means running unit test and Resnet50 benchmark with Tensor Core enabled conv2d. Winograd (TensorCore) means running unit test and Resnet50 benchmark with Tensor Core enabled winograd.
|batch size||Winograd (Original)||Winograd (TensorCore)||conv2d (Tensorcore)|
Table 2. 3x3 convolution with shape of 3x3x64x64. Shape of input feature maps is 56x56x64.
We can see from table 1 and table 2 that winograd with tensor core outperforms original winograd algorithm for all the batchsizes, and the speedup is in the range of [1.5, 2.1]. However, performance of winograd is worse than conv2d for large batchsizes when Tensor Core were enabled for both.
|batch size||Conv2d (TensorCore)||Winograd (Tensorcore)||SpeedUp|
Table 3. Resnet50 on T4.
Note: Only 3x3 convolutions were running with winograd algorithm.
Table 3 presents performance of resnet50 on Tensor Core with/without Winograd algorithm. The performance improvements are quite good for batchsize of 8 and 16 when Winograd algorithm is used. However, the performance of Winograd is worse than conv2d for batchsize 32, and 256.
1, The loops of Winograd bgemm were split into blocks that were consumed by Tensor Core. The dimensions of the blocks that feed into Tensor Core were relating to P(number of rows in the converted feature map matrix), input channel, and output channel. Input shapes of Tensor Core units must be (8, 16, 32), (16, 16, 16) or (32, 16, 8) for fp16, hereafter marked as (t1, t2, t3). Current implementation requires P, input channel, and output channel must be divided by t1, t2, and t3, respectively.
1, Optimizations of Winograd on Tensor Core works well for small batchsizes like 8 or 16. How to improve performance for large batchsize?