[Quantization] kl_divergence fails when feeding large inputs

robeastbme · May 4, 2020, 9:01am

Dear community, I’m using kl_divergence to quantize a quite big in-house network. I’ve implemented a mechanism to feed it pickle input frames which I generate from the reference implementation. Since the network inputs are quite large, the resulting (binary-encoded) pickle files grow to around 14MBs per frame… Currently I’m feeding around 157 frames (around 2.2GBs in total), where the quantizer fails with the following error:

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /home/buecs/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f969a57db25]
  [bt] (4) /home/buecs/tvm/build/libtvm.so(+0x402c34) [0x7f9699d55c34]
  [bt] (3) /home/buecs/tvm/build/libtvm.so(+0x402aa7) [0x7f9699d55aa7]
  [bt] (2) /home/buecs/tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule const&, tvm::transform::PassContext const&) const+0x389) [0x7f9699d557d9]
  [bt] (1) /home/buecs/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule const&, tvm::transform::PassContext const&) const+0x10f) [0x7f9699d549af]
  [bt] (0) /home/buecs/tvm/build/libtvm.so(+0xc25f8b) [0x7f969a578f8b]
  File "/home/buecs/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 78, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 191, in wrapped_func
    input_scale_func = _kl_scale(mod, dataset)
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 102, in _kl_scale
    scales += list(pool.map(_find_scale_by_kl, samples))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

I was trying to play with the calibrate_chunk_by parameter, but so far non of the tried value settings remove this error.

Has anyone encounter a similar error before? If yes, how could I solve / mitigate this? An expert opinion of for example @vinx13 would be much appreciated!

Thank you in advance & Best regards, Robert

vinx13 · May 4, 2020, 7:35pm

Seems a single sample in samples list is too large, you can try single-process version by editing relay/quantize/_calibrate.py

scales = [_find_scale_by_kl(sample) for sample in samples)]

robeastbme · May 5, 2020, 8:30am

Hi @vinx13, thank you very much for your suggestion. Avoiding multiprocessing in deed mitigates the problem of sending too large data chunks from a parent to a child process. Of course the calibration is slower, but this was expected.

However, I’m hitting now the system’s memory barrier (!!! 64GB ram + 64GB swap !!!) with only ~250 pickle input calibration frames.

When looking into the calibration mechanism in TVM, the flow is as follows (please correct me if I’m wrong):

All (pickle) calibration frames are loaded initially at once
The scales are calculated per frame

Would it be possible to modify this flow so that it requires less peak memory, somehow like this:

Load 1. (pickle) calibration frame
Calculate scales
Load 2. (pickle) calibration frame
Calculate scales
…

I guess this would need to modify this loop in a particular way.

Thank you very much in advance & Best regards, Robert

masahi · May 5, 2020, 9:15am

I think what you are trying to do is exactly what chunk_by parameter is for. Try smaller number than 250. If you calibrate on CUDA, the overhead from interleaved feature generation and scale calculation should be negligible.

robeastbme · May 5, 2020, 9:53am

Dear @masahi, thank you very much for your answer! I performed a couple of tests now, although I currently work on x86 (no CUDA). Nevertheless:

Peak memory consumption:

“calibrate_chunk_by” = 1000

RAM: 62.2GB
SWP: 46.7GB

“calibrate_chunk_by” = 100

RAM: 62.2GB
SWP: 47.0GB

So basically not much difference…hm… Best regards, Robert

masahi · May 5, 2020, 10:23am

Try smaller number. It seems your input is so big the RAM is already saturated with 100 inputs.

I suggest start with chunk_by = the number of cores and enable multiprocessing. This should give maximal parallelism while keeping mem usage low. And gradually make it bigger while observing how mem usage grows.

There is not much disadvantage in using small chunk_by, other than there would be more recompute. chunk_by = 1000 doesn’t make any faster and it is a bad idea. The overhead from more recompute can be hidden if you use CUDA in calibration (because a single inference is so cheap). Note that even if your final target is x86, you can still use CUDA target for calibration.

robeastbme · May 5, 2020, 11:10am

Thank you @masahi for the great advise! With “calibrate_chunk_by” = 16 (number of cores) the peak memory demand became significantly lower (~30GB, no swap) while the speed of calibration increased by far (due to no swapping).

Thank you very much & Best regards, Robert

robeastbme · May 5, 2020, 11:48am

Hm… Actually after kl_divergence is executed, I get the following error with low “calibrate_by_chunk” values (<=45):

ValueError: Traceback (most recent call last):                                                                                                                                                              
  [bt] (4) /home/buecs/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f356df85245]
  [bt] (3) /home/buecs/tvm/build/libtvm.so(+0x446d74) [0x7f356d699d74]
  [bt] (2) /home/buecs/tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x30d) [0x7f356d69890d]
  [bt] (1) /home/buecs/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x118) [0x7f356d699938]
  [bt] (0) /home/buecs/tvm/build/libtvm.so(+0xd2de6b) [0x7f356df80e6b]
  File "/home/buecs/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 78, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 190, in wrapped_func
    input_scale_func = _kl_scale(mod, dataset)
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 99, in _kl_scale
    for samples in collect_stats(mod, dataset, chunk_by):
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 92, in collect_stats
    yield [np.concatenate(output).reshape(-1) for output in outputs]
  File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 92, in <listcomp>
    yield [np.concatenate(output).reshape(-1) for output in outputs]
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Hm… weird Cheers, Robert

robeastbme · May 5, 2020, 3:10pm

Upon some debugging, I converted the following loop comprehension:

        yield [np.concatenate(output).reshape(-1) for output in outputs]

Into a more debuggable/printable form:

        retlist = []
        for output in outputs:
            elem = np.concatenate(output).reshape(-1)                                                                                                                                                                   
            retlist.append(elem)
        yield retlist

At one point outputs is a list of empty lists [[], [], [], [], [], []]

Super duper weird because this only happens with chunk_by value smaller equal 45…

robeastbme · May 7, 2020, 8:57am

Dear @masahi, dear @vinx13, After further debugging, I believe that either I have found a bug in TVM, or I’m using the quantizer in a way which is not supported (but also not limited). The problem can be reproduced with the off-the-shelf TVM tutorial! If you make the following change to this tutorial:

def quantize(mod, params, data_aware):
    if data_aware:
-        with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max'):
+        with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max', calibrate_chunk_by=46):
            mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())
    else:
        with relay.quantize.qconfig(calibrate_mode='global_scale', global_scale=8.0):
            mod = relay.quantize.quantize(mod, params)
    return mod

You use any other value below 47 to reproduce the problem. Why 47: because this is the value of num_outputs in the following loop:

github.com

apache/incubator-tvm/blob/149965a038356a6dcd42d58fdb1041af447cdd0e/python/tvm/relay/quantize/_calibrate.py#L85


    Returns
    -------
    ret: Iterable[list of ndarray]
        List of output data of each layer, chunked by the chunk_by parameter
    """
    logging.info("collecting statistics for calibration...")
    runtime = _get_profile_runtime(mod)
    num_outputs = runtime.get_num_outputs()
    chunk_by = num_outputs if chunk_by == -1 else chunk_by


    for i in range(0, num_outputs, chunk_by):
        outputs = [[] for i in range(min(chunk_by, num_outputs - i))]
        for batch in dataset:
            runtime.set_input(**batch)
            runtime.run()
            for j in range(i, min(i+chunk_by, num_outputs)):
                outputs[j-i].append(runtime.get_output(j).asnumpy())
        yield [np.concatenate(output).reshape(-1) for output in outputs]




def _kl_scale(mod, dataset):

Could anyone please comment on this issue @vinx13, @masahi? Thank you very much in advance & Best regards, Robert

masahi · May 7, 2020, 11:12am

Not sure if this is a bug. You are responsible for setting the right calibrate_chunk_by. Setting calibrate_chunk_by=46 when num output is 47 doesn’t make sense. Please read the code to understand the purpose of this parameter.

For the expected usage, see

github.com

apache/incubator-tvm/blob/master/tests/python/relay/test_pass_auto_quantize.py#L74-L81


def test_calibrate_memory_bound():
    mod, params = testing.resnet.get_workload(num_layers=18)
    dataset = get_calibration_dataset("data")
    import multiprocessing
    num_cpu = multiprocessing.cpu_count()
    with relay.quantize.qconfig(calibrate_mode="kl_divergence",
                                calibrate_chunk_by=num_cpu):
        relay.quantize.quantize(mod, params, dataset)

robeastbme · May 7, 2020, 11:51am

Hi @masahi, thank you for your reply. I do understand, but something is not right here: integrating exactly this test setup into the stock tutorial code leads me to the same error:

def quantize(mod, params, data_aware):
    if data_aware:
-       with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max'):
+       num_cpu = multiprocessing.cpu_count()
+       with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max', calibrate_chunk_by=num_cpu):
            mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())
    else:
        with relay.quantize.qconfig(calibrate_mode='global_scale', global_scale=8.0):
            mod = relay.quantize.quantize(mod, params)
    return mod

Just to give you the values range: in my case num_cpu=10 while the num_outputs=47 Would you be so kind to try it out in your environment? It would be great if someone could reproduce the issue… I’d really appreciate it!

Best regards, Robert

robeastbme · May 11, 2020, 8:35am

Dear @masahi, dear @vinx13 I was hoping that you could try out the change in my previous comment to be able to reproduce the issue. Please try it out if you find a little bit of free time.

Thank you very much in advance! Best regards, Robert

masahi · May 11, 2020, 9:29am

Ok I took a look at your problem. I got the error you saw too, but this is not a bug in calibration code, but it is due to the calibration dataset used in the tutorial.

Since the dataset is defined as a generator, you can only consume it once. But if you use calib_by param, we need to run multiple passes over calibration dataset, so the dataset should be list or other data structures that can be traversed multiple times.

If you replace calibrate_dataset() function in the tutorial with below, it should work.

def calibrate_dataset():
    val_data, batch_fn = get_val_data()
    val_data.reset()
    calib_data = []
    for i, batch in enumerate(val_data):
        data, _ = batch_fn(batch)
        calib_data.append({'data': data})
        if i > calibration_samples:
            break
    return calib_data

f2013519 · October 18, 2023, 5:34am

Was stuck here as well. This was helpful, thanks!