Improving quantization accuracy with more precise bias


#1

I’m working on int8 calibration and have some observations on the choice of the precision of bias.

I implemented per-layer activation scale (similar to this pr https://github.com/dmlc/tvm/pull/2753) and used a simple calibration method by taking power2 scale of maximum abs value of each layer’s output on a small calibration dataset. This approach works well (imagenet abs top-1 accuracy drop ~1%)on some models resnet-50, vgg-16.

However, on resnet-101 the accuracy loss is larger. I tried to quantize bias term (rhs constant of add) using 16 bits (32 bits will make lhs overflow after left shift) and obtained good accuracy. Specifically, in add_rewrite I marked the rhs constant as QAnnotateKind.BIAS and modified these lines https://github.com/dmlc/tvm/blob/31ba01399e6f9f0b4146afee685f16c5ddc68f91/python/tvm/relay/quantize/quantize.py#L243 as:

if kind == QAnnotateKind.BIAS:
  const_params[ndom_scale] = _make_const(scale / (2**15))
  const_params[nclip_min] = _make_const(- ((2**31) - 1))
  const_params[nclip_max] = _make_const(((2**31) - 1))

My patch will change the scale selected in AddRealize. Previously scale of lhs is selected in almost all cases because lhs scale is from conv2d, which is multiplication of input and weight scale (smaller than rhs), so bias will be shifted left in this case.
After my patch, bias scale will be selected and conv2d result will be shifted left.

Left-shift in either cases doesn’t lead to overflow and should not cause precision loss. But it is possible that bias are shared per-channel so more bits help.

I would like to discuss the observation here and see how we can improve the quantization accuracy.

Some of my experiment numbers:
resnet101 top1/top5 accuracy on Imagenet (first 3k images)

  • power-of-2 scale weight & activation + 8bit bias (current TVM) 0.6876/0.8887
  • power-of-2 scale weight & activation + 9bit bias 0.7633/0.9276
  • power-of-2 scale weight & activation + 16bit bias: 0.7697/0.9353
  • float32 0.7777/0.9387

cc @eqy @ziheng @tqchen


#2

Thanks for the great work!

I’m currently working on fully automatic calibration that just does the most commonly used method: picked the domain scale that minimizes the L2-norm. We use this method to implement calibration for both per-layer and optionally per-channel scales—and potentially weights as well. This also provides some improvements to the current quantization accuracy results, so I am interested to see what we can get with the combination of all the improvements.

By the way, have you tried your pass on the v2 versions of resnet in the mxnet model zoo? I see catastrophic accuracy drops with the current pass on those models, and I wonder if it is due to problems with the bias.


#3

Yes resnet mentioned above are v2. Resnet-50,101 v2 have ~1% drop.
I observed significant accuracy drop on resnet18 v2 using power2 scale activation instead of global scale, but the accuracy is normal after setting skip_k_conv = 2

btw I’m also working on KL divergence based scale. I assume it will have similar results to L2 norm based ones.


#4

By the way, I noticed that 2723 added a skip callback https://github.com/dmlc/tvm/blob/d39a4ea000d6d2a1879c0a6a7aa819de4e16eb27/python/tvm/relay/build_module.py#L183 which breaks CSE (does not crunch down matching int32 casts) for my per-channel quantization pass. Is there another use case that requires this callback?


#5

The usecase quantized resnet.
Consider a residual add case

data
|     |
bn    |
|     |
relu  |
|     |
conv  | 
|    /
add

After quantization, there will be duplicated simulated_quantization(data, INPUT) in two branches.

Writing & reading int32 result to/from global memory can be slow so we use stop fusion to ensure that subfunction output int8 results. We don’t combine cast(i32) so that cast(i32) will be done in each consumer subfunction.


#6

Why is stop_fusion not orthogonal to removing duplicated int32 casts?


#7

stop_fusion only breaks fusion of cast(i8) and cast(i32). We want to keep the duplicated i32 casts.


#8

I am not sure I understand the situation, can you annotate where the casts are in your example?

Right now I think the current fskip implementation is too brittle and has consequences downstream. In my quantization use case skipping cast(i32) causes every identity branch to be exhaustively recomputed because CSE stops at that step.


#9

The above example after annotation:

data
|                            |
sim_quantize(QINPUT) sim_quantize(QINPUT)
|                            |
add(bn_bias) 
|
...                     / 
|                        
add

data is usually output of previous conv2d. There are duplicated simulated_quantize. Followed add in both branches will convert the int8 to int32. So simulated_quantize + add in both branches which will be translated to right_shift + cast(i8) + cast(i32)
We use stop_fusion to ensure that previous conv2d result will be casted to int8 before saving in global memory.

You will see the difference running quantized ResNet-50 v2.


#10

So the issue is I think we have somewhat different use cases :); I am prototyping per-channel quantization on CPU, where the compute:bandwidth ratio is lower so the different is probably not as apparent. However, in my situation preventing the casts from being removed also explodes even resnet-18 to over 3000 intermediate values which is far worse than the bandwidth overhead. I wonder if modifying the annotate pass to treat the adds differently here will work.


#11

This annotation https://github.com/dmlc/tvm/blob/a6d04b8daaa1e75b00b61755260f8ea17f07ba7c/python/tvm/relay/quantize/_annotate.py#L239 might be unnecessary in your case (but useful for current CUDA quantization). If you remove this line, there won’t be duplicated simulated_quantize.