[Heterogeneous Execution] [Error] Variable placement for different devices

Hello. I found a small issue of heterogeneous execution when using a same variable on different devices from annotation API.
The code I used is as follows.

import numpy as np
import tvm
from tvm import relay

dshape = (5,5)
data1 = relay.var("data1", shape=dshape)
data2 = relay.var("data2", shape=dshape)
add1 = relay.add(data1,data2)
add2 = relay.add(add1, data2)

dev1 = tvm.context(1)
dev2 = tvm.context(2)
_add_1 = relay.annotation.on_device(add1, dev1)
_add_2 = relay.annotation.on_device(add2, dev2)

func = relay.Function([data1, data2],
                       relay.Tuple(tvm.convert([_add_1, _add_2, add2])))
func = relay.ir_pass.infer_type(func)
func = relay.ir_pass.rewrite_annotated_ops(func,
                                           tvm.context("cpu").device_type)
func = relay.ir_pass.infer_type(func)
func = relay.Function(relay.ir_pass.free_vars(func.body[2]), func.body[2])

d1 = np.random.uniform(size=dshape).astype('float32')
d2 = np.random.uniform(size=dshape).astype('float32')
w = np.random.uniform(size=dshape).astype('float32')
config = {"opt_level": 1}
target = {"cpu": "llvm", "cuda": "cuda"}
params = {}
with relay.build_config(**config):
    graph, lib, params = relay.build(
        func,
        target,
        params = params)
contexts = [tvm.cpu(0), tvm.context("cuda")]
mod = tvm.contrib.graph_runtime.create(graph, lib, contexts)
mod.set_input(**params)
mod.set_input("data1",d1)
mod.set_input("data2",d2)
mod.run()
result = mod.get_output(0).asnumpy()

Error is as follows.

TVMError: [16:52:34] /home/morinaga/tvm/tvm/src/runtime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) Assert fail: (2 == tvm_struct_get(arg1, 0, 10)), Argument arg1.device_type has an unsatisfied constraint

This is due to the lack of device_copy, so we can avoid this issue like following example.

## Example
data1 = relay.var("data1", shape=dshape)
data2 = relay.var("data2", shape=dshape)
add2 = relay.add(data1,weight)
_data2 = relay.device_copy(_data2, dev1, dev2)
add2 = relay.add(add1, _data2)
_add_1 = relay.annotation.on_device(add1, dev1)
_add_2 = relay.annotation.on_device(add2, dev2)

In my thoughts user should only be aware of the operator placement in annotation API (without using device_copy) . So this should be resolved while compiling relay graph. It seems not difficult to insert device_copy in relay.ir_pass. However, inserting device_copy between data2 and add2 might have different performance from inserting between data2 and add1. Another way is treat as usage miss and raise error.

What is the cleverest way?