[RFC] Use auto-tuner to improve conv2d_gemm performance

Introduction and motivations

In the past few weeks, we introduced quite few optimizations for AArch64 targets:

The aim of those optimizations was to have a “good” enough out-of-the-box performance. We will use this RFC to track down auto-tuner optimizations related to quantized convolution for AArch64 targets (with NHWC layout).

Tuning entities

In the following paragraphs we provide a list of the tuning entities we use in our conv2d schedule.

Unrolling and vectorizing matrix transform

After the im2col operation we need to interleave the input matrix in a [rows/4, cols/16, 4, 16] shape. The interior loop on 4 and 16 elements is a perfect candidate for an annotation with a try_unroll_vec policy.

Reordering gemm

When we calculate gemm we run the computation over a shape of [M//4, N//4, 4, 4] and then parallelize over the outer shape, which by default is M//4. However, M might be too small to offer enough parallelization. The idea is to reorder the outer dimensions [M//4,N//4] through a (0,1) or (1,0) reordering (default is (0,1)). We then parallelize over the outer dimension.

Unrolling gemm_quantized intrinsic

The inner loop of GEMM is done through a gemm_quantized_4_4 intrinsic. This is a hand-written piece of AArch64 assembly. In order to introduce loop unrolling we add a boolean knob unroll and use it within the implementation. In the following snippet we show how the unroll knob will be used:

if unroll: 
    k = int(K//16) 
    for l in range(0,k): 
        cc_code += main_loop 
    else: 
        cc_code += main_loop 
        cc_code += """ "subs %w[k], %w[k], #1\\n"
                       "cbnz %w[k], 1b\\n" """

In the future, we might either use more sophisticated policies within the intrinsic implementation (easy) or we might try to move the loop outside the intrinsic (i.e., using normal TIR Itearvars) and use standard TVM annotation entities (this will be hard, because of the final accumulation “after” the loop, see [RFC] Improve quantized convolution performance for armv8 architectures).

Parallel pipelines

As described in the optimization guide of many recent Arm processors (e.g., Neoverse-N1) instructions like uadalp or umull could go through different functional units and this requires slightly different instruction scheduling.

If we have a look at the original intrinsic implementation (for the higher part of the first half), we see the following:

// Higher part of a0 * {b0,b1,b2,b3}
"umull v8.8h, v0.8b, v4.8b\\n"
"umull v9.8h, v0.8b, v5.8b\\n"
"umull v10.8h, v0.8b, v6.8b\\n"
"umull v11.8h, v0.8b, v7.8b\\n"
// Higher part of a1 * {b0,b1,b2,b3}
"umull v12.8h, v1.8b, v4.8b\\n"
"umull v13.8h, v1.8b, v5.8b\\n"
"umull v14.8h, v1.8b, v6.8b\\n"
"umull v15.8h, v1.8b, v7.8b\\n"
// Accumulate
"uadalp v16.4s, v8.8h\\n"
"uadalp v17.4s, v9.8h\\n"
"uadalp v18.4s, v10.8h\\n"
"uadalp v19.4s, v11.8h\\n"
"uadalp v20.4s, v12.8h\\n"
"uadalp v21.4s, v13.8h\\n"
"uadalp v22.4s, v14.8h\\n"
"uadalp v23.4s, v15.8h\\n"

So the uadalp and umull instructions are batched together (first a round of umulls and then a round of uadalp). Depending on the latencies and the micro-architecture under question different implementations will have different behaviour.

// First half
// Higher part of a0 * {b0,b1,b2,b3} and accumulate
"umull v8.8h, v0.8b, v4.8b\\n"
"uadalp v16.4s, v8.8h\\n"
"umull v9.8h, v0.8b, v5.8b\\n"
"uadalp v17.4s, v9.8h\\n"
"umull v10.8h, v0.8b, v6.8b\\n"
"uadalp v18.4s, v10.8h\\n"
"umull v11.8h, v0.8b, v7.8b\\n"
"uadalp v19.4s, v11.8h\\n"
 
// Higher part of a1 * {b0,b1,b2,b3} and accumulate
"umull v12.8h, v1.8b, v4.8b\\n"
"uadalp v20.4s, v12.8h\\n"
"umull v13.8h, v1.8b, v5.8b\\n"
"uadalp v21.4s, v13.8h\\n"
"umull v14.8h, v1.8b, v6.8b\\n"
"uadalp v22.4s, v14.8h\\n"
"umull v15.8h, v1.8b, v7.8b\\n"
"uadalp v23.4s, v15.8h\\n"

Interleaving those instructions improves the use of the pipeline and the speed of convolution (since the umull and uadalp instructions will be able to execute in parallel). Instead of choosing between one or another implementation, we introduce an interleave boolean knob to switch between a batched vs interleaved intrinsic. This allows the auto-tuner to choose the best implementation for the underlying micro-architecture.

Results

We tested these changes against few known networks (instead of focusing on a single one). In the following table we show the results (we are running on a Neoverse-N1 device, with 4 threads):

Network TFlite/TVM
inception V3 1.1757897163938207
inception V4 1.1280155642023346
resnet 50 1.337992772667543
squeezenet 1.476169590643275
vgg 16 1.102926185491945

Few things to note:

  • We are now between 10% and 47% faster than TFlite on AArch64
  • Tuning time is at most 10 minutes (for the very big networks)
  • We didn’t evaluate mobilenet networks, as those rely on depthwise optimizations. We are carrying those optimizations in this RFC and will try to evaluate those at a later stage

PR

The PR for this RFC is available here

2 Likes

cc @ramana-arm @anijain2305 @FrozenGene