I just realized that when code is generated for an architecture that does not support fast int8 operations the compiler will generate 16-bit weights in the .params file. The idea is that doing otherwise would produce a sub-optimal code schedule & possibly invalid/inefficient memory access.
Out of curiosity I proceeded to comment out the checks is_fast_int8_on_arm() and is_fast_int8_on_x86(), along with the expected return values, in tvm/python/tvm/relay/qnn/op/legalizations.py. While I was able to produce running code, it was indeed much slower on both arm and x86. I also noticed that for one of my test cases it produced an error during compilation. No surprises there I guess.
Clearly I am forcing the compiler into doing what it should not, but I wonder if there is a cleaner way of generating int8 storage even if that would yield a sub-optimal schedule. I can see several use cases in which storing a .params file as 8-bit integers takes priority over inferences/second.
Thanks for any hints on how to do this or even if its feasible without messing up subsequent optimization phases.