TVM auto-tuning: Winograd conv2D implementation for int8 gives wrong classification results

chrschinab · July 3, 2020, 12:20pm

First, I used the int8 quantization without auto-tuning for resnet50 following this tutorial and I found that the classification accuracy is only slightly reduced compared to the FP32 result.

So far so good. However, if the int8 quantization is followed by auto-tuning, the classification accuracy drops to almost zero for resnet50. Finally, I could attribute this issue to the winograd implementation configurations of conv2D layers in int8: For some int8 conv2D-layers the auto-tuner optimizes the direct implementation as well as the Winograd implementation. Both configurations are saved as “best configurations” in the tuning .log file (via autotvm.record.pick_best). In the following, I provide an example for such a configuration pair which can be found in the tuning .log file:

{“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 128, 28, 28], “int8”], [“TENSOR”, [128, 128, 3, 3], “int8”], [1, 1], [1, 1, 1, 1], [1, 1], “int32”], {}], “config”: {“index”: 255798, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 8, 4, 4]], [“tile_x”, “sp”, [-1, 2, 49, 1]], [“tile_rc”, “sp”, [-1, 32]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[3.169832654949121e-05], 0, 1.5491132736206055, 1593433510.9009185], “version”: 0.2, “tvm_version”: “0.7.dev1”}

{“input”: [“cuda -model=unknown”, “conv2d_NCHWc_int8.cuda”, [[“TENSOR”, [1, 128, 28, 28], “int8”], [“TENSOR”, [128, 128, 3, 3], “int8”], [1, 1], [1, 1, 1, 1], [1, 1], “NCHW”, “int32”], {}], “config”: {“index”: 155223637, “code_hash”: null, “entity”: [[“tile_n”, “sp”, [-1, 1, 1, 1]], [“tile_f”, “sp”, [-1, 2, 1, 4]], [“tile_y”, “sp”, [-1, 1, 4, 1]], [“tile_x”, “sp”, [-1, 2, 14, 1]], [“fuse_yx”, “ot”, 0], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“reorder_inner”, “re”, [0, 1, 2]], [“AA_double_buffer”, “ot”, 0], [“WW_double_buffer”, “ot”, 1], [“auto_unroll_max_step”, “ot”, 512]]}, “result”: [[8.578269310344828e-05], 0, 1.6151304244995117, 1593433548.2312698], “version”: 0.2, “tvm_version”: “0.7.dev1”}

As some of the Winograd implementations were in my case a little bit faster, these configurations were chosen for compilation and result in an incorrect output of the network. However, if I delete the winograd INT8 configurations for these layers from the .log file, such that only the configuration for the direct implementation remains in the .log file, the output of the network is correct again.

Consequently, I wonder if there is a bug in the winograd conv2D implementation for INT8. If yes, can this bug be fixed in the near future?

The problem might be related to this github issue.

masahi · July 3, 2020, 8:17pm

Yes we had a bug in ARM for quantized + winograd case you mentioned. I was under impression that winograd is not supposed to be enabled for int8 under cuda target, but if this is happening with auto tuning, this sounds like a bug.

cc @vinx13 @anijain2305

masahi · July 3, 2020, 8:25pm

For now you can add dtype == int8 check in this condition. This is the same fix applied to ARM (see the PR above)

github.com

apache/incubator-tvm/blob/master/python/tvm/relay/op/strategy/cuda.py#L120-L121


if 2 < kh < 8 and 2 < kw < 8 and kh == kw and stride_h == 1 and stride_w == 1 and \
    dilation_h == 1 and dilation_w == 1: