Skipping first layer from 8-bit quantization

In the tutorial to deploy pretrained vision model on VTA (https://docs.tvm.ai/vta/tutorials/frontend/deploy_vision_on_vta.html#build-the-inference-graph-runtime), why is the first conv and dense layers skipped for 8-bit quantization?

1 Like

Mostly due to the fact that we’re offloading the first layer to the CPU, and that quantization does not really improve performance on this layer executing on CPU compared to fp32.

Why is first layer being offloaded to the CPU?

VTA performs matrix-vector multiplication; requiring a minimum channel depth of 16 (by default). The first layer has 3 channels, so we would require expensive layout transformations, and mostly pad channels with 0s, leading to ineffective computation. Therefore it’s more cost effective to go with CPU on layer 0.

1 Like

I see. Thanks for your replies.

Does TVM make this decision based on computation efficiency of CPU vs VTA per-layer or only the first layer? Does it affect all types of layers or only conv?

That was a manual decision; TVM does not make that decision automatically at the moment, but having such feature (for heterogeneous execution) would most definitely be valuable.

This only affects the first layer of resnet.

Thank for your explanation. I was wondering if your conclusion hold for GPU-based computation.

In tvm framework, the first layer is skipped in the quantization process.

    _node_defaults = {
        "nbit_input": 8,
        "nbit_weight": 8,
        "nbit_activation": 32,
        "dtype_input": "int8",
        "dtype_weight": "int8",
        "dtype_activation": "int32",
        "calibrate_mode": "global_scale",
        "global_scale": 8.0,
        "weight_scale": "power2",
        "skip_dense_layer": True,
        "skip_conv_layers": [0],
        "do_simulation": False,
        "round_for_shift": True,
        "debug_enabled_ops": None,
        "rounding": "UPWARD",
        "calibrate_chunk_by": -1,
        "partition_conversions": "disabled",
    }

source: https://github.com/apache/tvm/blob/main/python/tvm/relay/quantize/quantize.py

What is the reason of this design choice (in cuda-based optimization) ?