The model for mxnet FP16 cannot be use

ydy · August 22, 2019, 8:36am

I have a model of “mxnet fp16”，When I use from_mxnet , all .batch_norm are wrong. like this

%493 = nn.batch_norm(%492, meta[relay.Constant][548], meta[relay.Constant][549], meta[relay.Constant][550], meta[relay.Constant][551], epsilon=2e-05f) unable to unify:Tensor[(64), float16]andTensor[(64), float32]; unable to unify:Tensor[(64), float16]andTensor[(64), float32]; unable to unify:Tensor[(64), float16]andTensor[(64), float32]; unable to unify:Tensor[(64), float16]andTensor[(64), float32];

What should I do to solve it, thanks a lot!

haichen · August 22, 2019, 4:48pm

It’s probably something wrong in the src/relay/pass/simpolify_inference.cc where the batch_norm is expanded. Could you take a look at it?

ydy · August 23, 2019, 1:54am

I tried it, and it didn’t seem like the problem was there. I wrote a lot of output under this file, but it didn’t work, so the program crashed before that.

Traceback (most recent call last):

  File "from_mxnet.py", line 135, in <module>
    graph, lib, params = relay.build(func, target, params=relay_params)

  File "/home/nvidia/tvm/python/tvm/relay/build_module.py", line 207, in build
    graph_json, mod, params = bld_mod.build(func, target, target_host, params)

  File "/home/nvidia/tvm/python/tvm/relay/build_module.py", line 108, in build
    self._build(func, target, target_host)

  File "/home/nvidia/tvm/python/tvm/_ffi/_ctypes/function.py", line 210, in __call__
    raise get_last_ffi_error()

tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /home/nvidia/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<tvm::runtime::ModuleNode> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x2c) [0x7f2100d78c]
  [bt] (7) /home/nvidia/tvm/build/libtvm.so(tvm::relay::backend::RelayBuildModule::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<tvm::runtime::ModuleNode> const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#3}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0xb8c) [0x7f2100d354]
  [bt] (6) /home/nvidia/tvm/build/libtvm.so(tvm::relay::backend::RelayBuildModule::BuildRelay(tvm::relay::Function, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::NDArray, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> > > const&)+0x1b8) [0x7f2100bb00]
  [bt] (5) /home/nvidia/tvm/build/libtvm.so(tvm::relay::ModuleNode::FromExpr(tvm::relay::Expr const&, tvm::Map<tvm::relay::GlobalVar, tvm::relay::Function, void, void> const&, tvm::Map<tvm::relay::GlobalTypeVar, tvm::relay::TypeData, void, void> const&)+0x1e8) [0x7f210d79c0]
  [bt] (4) /home/nvidia/tvm/build/libtvm.so(tvm::relay::ModuleNode::Add(tvm::relay::GlobalVar const&, tvm::relay::Function const&, bool)+0x668) [0x7f210d6b38]
  [bt] (3) /home/nvidia/tvm/build/libtvm.so(tvm::relay::InferType(tvm::relay::Function const&, tvm::relay::Module const&, tvm::relay::GlobalVar const&)+0x338) [0x7f2131b1c0]
  [bt] (2) /home/nvidia/tvm/build/libtvm.so(tvm::relay::TypeInferencer::Infer(tvm::relay::Expr)+0x7c) [0x7f2131a5ac]
  [bt] (1) /home/nvidia/tvm/build/libtvm.so(tvm::relay::ErrorReporter::RenderErrors(tvm::relay::Module const&, bool)+0x1304) [0x7f210a483c]
  [bt] (0) /home/nvidia/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4c) [0x7f20c9add4]
  File "/home/nvidia/tvm/src/relay/ir/error.cc", line 133
TVMError:
Error(s) have occurred. The program has been annotated with them:

In `main`:
v0.0.3
fn (%data: Tensor[(1, 3, 224, 224), float16]) {
  %0 = cast(%data, dtype="float16");
  %1 = nn.batch_norm(%0, meta[relay.Constant][0], meta[relay.Constant][1], meta[relay.Constant][2], meta[relay.Constant][3], epsilon=2e-05f, scale=False) unable to unify: `Tensor[(3), float16]` and `Tensor[(3), float32]`; unable to unify: `Tensor[(3), float16]` and `Tensor[(3), float32]`; unable to unify: `Tensor[(3), float16]` and `Tensor[(3), float32]`; unable to unify: `Tensor[(3), float16]` and `Tensor[(3), float32]`; ;
  %2 = %1.0;

haichen · August 23, 2019, 6:36pm

Could you share the minimal code that replicate this error?

ydy · August 26, 2019, 1:56am

Sorry, I gave some incorrect information before, after I updated TVM (I may have changed some source code before), the current error message is：
File “from_mxnet.py”, line 128, in
arg_params=args, aux_params=auxs, dtype=“float16”)

  File "/root/workspace/tvm/python/tvm/relay/frontend/mxnet.py", line 1210, in from_mxnet
    shape, dtype = _update_shape_dtype(shape, dtype, params)

  File "/root/workspace/tvm/python/tvm/relay/frontend/mxnet.py", line 1156, in _update_shape_dtype
    "%s: dtype not expected %s vs %s" % (k, dtype, v.dtype))

ValueError: resnetv10_batchnorm0_gamma: dtype not expected float16 vs float32

I can see that this is a type mismatch problem, but I’m using the MXnet example file to train the FP16 model,Ithink it’s not the model’s problem，so I’m a little confused as to what exactly went wrong. Could you give me some hints?
thanks a lot！@haichen

cchung100m · August 28, 2019, 8:59am

Hi @ydy

Could you help to check the resnetv10_batchnorm0_gamma from your model using netron?

nicklhy · March 25, 2020, 9:40am

Any updates here ? Got similar problems when converting fp16 mxnet model to tvm. MXNet by default converts weight/bias of conv/dense layers to float16 while keeps gamma/beta/mean/var in BN layers as float32. I guess this is the reason that tvm failed during conversion.

FrozenGene · March 25, 2020, 3:50pm

I think this is important for MXNet AMP (Auto Mix Precision) model.

tiandiao123 · January 8, 2022, 1:49am

someone fixed this problem? I got same error