[Quantization] Format of the quantized tensors

alopez_13 · February 7, 2020, 5:28pm

After quantizing a model using relay.quantize.quantize and passing it to relay.build I wanted to look at the quantized tensors in the params object returned by relay.build. The params object is a dictionary of {tensor_name: tensor_values}. The size of params is ~1/4 of the non-quantized one since only INT8 values are written into it, or that is what I thought.

Looking at the params object I notice that some entries are listed as dtype=float32, which makes no sense to me since I assume that only INT8 operations are allowed in the final graph. Is there any documentation on what is stored, such as min, max, zp, scale? or are those computed during the quantization process and never stored? I explored the params object without serializing its contents.

Also I notice that the code in relay.save_param_dict(params) and relay.load_param_dict(param_bytes) functions do not check for types and now I wonder if all type information may get lost if the parameters are serialized when writing/reading from a file. When I write the params file and then read it back, using these functions, I see more FP32 tensors compared to the case described above.

Any pointers as to where I should I start looking in the code base?

Thanks!

janimesh · February 7, 2020, 5:54pm

@vinx13 You might have some idea about this.

vinx13 · February 7, 2020, 6:32pm

min, max, zp, scale are computed during quantization but they are not stored. They are used to generate operations such as mul / add for quantize/requantize. And only part of the quantization params (min, max, zp, scale) are kept as the input for these operations. You can get some hint from the param name in param dict to see what the fp32 params is.

The types are serialized in relay.save_param_dict

github.com

apache/incubator-tvm/blob/master/src/relay/backend/param_dict.cc#L65


    dmlc::MemoryStringStream strm(&bytes);
    dmlc::Stream* fo = &strm;
    uint64_t header = kTVMNDArrayListMagic, reserved = 0;
    fo->Write(header);
    fo->Write(reserved);
    fo->Write(names);
    {
      uint64_t sz = static_cast<uint64_t>(arrays.size());
      fo->Write(sz);
      for (size_t i = 0; i < sz; ++i) {
        tvm::runtime::SaveDLTensor(fo, arrays[i]);
      }
    }
    TVMByteArray arr;
    arr.data = bytes.c_str();
    arr.size = bytes.length();
    *rv = arr;
  });


TVM_REGISTER_GLOBAL("tvm.relay._load_param_dict")
.set_body([](TVMArgs args, TVMRetValue *rv) {

alopez_13 · February 7, 2020, 6:42pm

Thanks! I’ll look into that. The thing is that the original tensor names are lost after the build process, instead I get a sequential list of numbers (e.g. p1, p2, …). I guess I can map those back to the original names since they are generated top-down as you traverse the graph.

alopez_13 · February 7, 2020, 7:59pm

I think I understand what is going on. I generated the relay IR for my test case (mobilenet_v2) and found that some of the operands are FP tensors. I used the Deploy a Quantized Model on Cuda as the starting point for my quantization, so perhaps I am missing some other parameters to tell TVM to fully quantize all of the tensors so all operations are done using integer arithmetic.

In any event that does explain why some of the parameters are in FP format.

adb · February 10, 2020, 8:04pm

Not all operators have a quantize implementation. For example nn.bias_add. If your graph contains these ops then they will be left as float32.

alopez_13 · February 11, 2020, 5:39pm

Thanks a lot, that does explain what I see in the Relay IR. I’ll start looking more in depth at the quantization code to fully understand where things stand.

adb · February 11, 2020, 6:00pm

Keep in mind this isn’t a limitation of TVM, quantization is still a work-in-progress. These ops will likely have quantize implementations in the future.

alopez_13 · February 11, 2020, 6:05pm

Yes, I know TVM is actively working in Quantization. I actually want to contribute to that effort, but have some time constraints, that is why I was looking to what is done and what is missing. If I can align my current work to the needs of TVM I’ll start sending some PRs

tico · March 4, 2020, 1:12pm

@adb @vinx13 Where can I find the current list of ops with quantization implementation?

vinx13 · March 4, 2020, 3:55pm

Rewrite rules are https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/quantize/_annotate.py