[Heterogenous execution] device_type need to be 2, MX-Net model

Hello,

I am trying to use heterogenous execution (put conv2d on GPU) on a Mobilenet+SSD V1 coming from an MX-Net model (.params and .json files).

I’ve followed the example shown in #3621 for annotate using a visitor: the compilation goes well and the graph seems ok (nodes copied and attributed to different devices).

But I can’t execute the model, I’m doing:
target = {“gpu”: “cuda”, “cpu”: “llvm”}
with relay.build_config(opt_level=3, fallback_device=tvm.cpu(0)):
graph, lib, params = relay.build(net, target=target, params=params)
ctx = [tvm.cpu(0), tvm.context(“cuda”)]
mod = runtime.create(graph, lib, ctx)
mod.set_input(**params)
mod.run()

and I get the error:
mod.run()
File “/home/renault/tvm/python/tvm/contrib/graph_runtime.py”, line 168, in run
self._run()
File “tvm/_ffi/_cython/./function.pxi”, line 310, in tvm._ffi._cy3.core.FunctionBase.call
File “tvm/_ffi/_cython/./function.pxi”, line 245, in tvm._ffi._cy3.core.FuncCall
File “tvm/_ffi/_cython/./function.pxi”, line 234, in tvm._ffi._cy3.core.FuncCall3
File “tvm/_ffi/_cython/./base.pxi”, line 170, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (3) /home/renault/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7f30c19ba901]
[bt] (2) /home/renault/tvm/build/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x47) [0x7f30c1a0d897]
[bt] (1) /home/renault/tvm/build/libtvm.so(+0x138d6c7) [0x7f30c1a0f6c7]
[bt] (0) /home/renault/tvm/build/libtvm.so(+0x1346ac0) [0x7f30c19c8ac0]
File “/home/renault/tvm/src/runtime/module_util.cc”, line 73
TVMError: Check failed: ret == 0 (-1 vs. 0) : Assert fail: (dev_type == 2), device_type need to be 2

Do you have any ideas of what I’m doing wrong? Thanks!
@zhiics

It looks that the device type of some weights are not propagated properly. Can you try the fix in [Relay] Homogenous Compilation Example to see if it works? I tried it on a P2 instance and it worked fine. I will probably spend time on it next week to debug.

1 Like

Thanks @zhiics it works!

I’m using an Intel Xeon E5-2690 and a NVIDIA RTX 2060.

My inference time without tuning is 85ms, with cuda conv2d tuning it’s 75ms.
The weird thing is when I don’t use heterogenous compilation & execution (“target=tvm.target.cuda(model=‘2060’)” and “context=tvm.context(‘cuda’)”), my inference time without tuning is 11.4ms, and 6.6ms with tuning.

Do you have any ideas why heterogenous execution performs way worse? Do you know if there is some annotation pass done automatically when I don’t use heterogenous compilation & execution? Because the inference time seems low, and I thought the NMS take some time on a GPU.

Thanks!

I think there are at least two reasons:

  1. AlterLayout transformation is not performed under heterogeneous execution.
  2. Back-and-forth data copy ops are needed. It will be many in this case as only conv2d ops are executed on GPUs.

You can probably see how much is contributed by AlterLayout by executing all ops on GPU but with this pass disabled.

I ran in to this issue just now when adding my own backend. @zhiics’s fix worked for me.

I think this should be pulled into master, not sure how heterogeneous execution can work without it. @tqchen

Slight changes to @zhiics’s code since API has been changed:
src/relay/pass/device_annotation.cc

  void FillPropagation(int out_dev_type) {
    for (const auto& it : post_visitor_.post_dfs_order_) {
      Expr expr = GetRef<Expr>(it.first);
      if (!it.second) device_map_.Set(expr, out_dev_type);
      if (const auto* call = expr.as<CallNode>()) {
        for (const auto& arg: call->args) {
          if (arg->IsInstance<VarNode>() || arg->IsInstance<ConstantNode>())
            device_map_.Set(arg, device_map_[expr]);
        }
      }
    }
  }

@adb I think I have a patch to clean it up and fix the algorithm locally several month back. But I forgot to upstream it. I will find it and send a PR once I have time.

1 Like

Sounds good. Thanks, @zhiics