How to use heterogeneous execution?

kazum · February 17, 2019, 6:58pm

Hi all. I tried heterogeneous execution with the following code, based on tests/python/relay/test_pass_annotation.py.

import tvm
import tvm.relay as relay
import numpy as np

R""" The network is as following:                                                                                                                         
           x     y                                                                                                                                        
            \   /                                                                                                                                         
             add                                                                                                                                          
            /   \                                                                                                                                         
         sqrt   log                                                                                                                                       
            \   /                                                                                                                                         
          subtract                                                                                                                                        
              |                                                                                                                                           
             exp                                                                                                                                          
"""

fallback_device = tvm.context("cpu")
target = {"cpu": "llvm", "cuda": "cuda"}
dev_ctx = tvm.context("cuda")
cpu_ctx = fallback_device

x = relay.var("x", shape=(1, 10))
y = relay.var("y", shape=(10, 10))
add = relay.add(x, y)
sqrt = relay.sqrt(add)
_sqrt = relay.annotation.on_device(sqrt, dev_ctx)
log = relay.log(add)
subtract = relay.subtract(sqrt, log)
exp = relay.exp(subtract)
_exp = relay.annotation.on_device(exp, dev_ctx)

func = relay.Function([x, y], relay.Tuple(tvm.convert([_sqrt, _exp, exp])))
func = relay.ir_pass.infer_type(func)
func = relay.ir_pass.rewrite_annotated_ops(func, cpu_ctx.device_type)
func = relay.ir_pass.infer_type(func)
func = relay.Function(relay.ir_pass.free_vars(func.body[2]), func.body[2])
print(func)   

x_data = np.random.rand(1, 10).astype('float32')
y_data = np.random.rand(10, 10).astype('float32')
params = {"x": x_data, "y": y_data}

with relay.build_config(opt_level=1):
    graph, lib, params = relay.build(func, target=target, params=params)

module = tvm.contrib.graph_runtime.create(graph, lib, [cpu_ctx, dev_ctx])
module.set_input(**params)
module.run()
module.get_output(0).asnumpy()

What I expected was:

CPU executes ‘add’, ‘log’, and ‘subtract’.
GPU executes ‘sqrt’ and ‘exp’.

However, the result was all the operators are executed on GPU.

$ nvprof --print-gpu-summary python hetero_execution.py
fn (%x: Tensor[(1, 10), float32],
    %y: Tensor[(10, 10), float32]) {
  %0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
  %1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
  %3 = device_copy(%2, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
  %4 = log(%0) # ty=Tensor[(10, 10), float32]
  %5 = subtract(%3, %4) # ty=Tensor[(10, 10), float32]
  %6 = device_copy(%5, meta[relay.attrs.DeviceCopyAttrs][2]) # ty=Tensor[(10, 10), float32]
  %7 = exp(%6) # ty=Tensor[(10, 10), float32]
  %7          
}             
# meta data omitted. you can use show_meta_data=True to include meta-data
              
==19110== NVPROF is profiling process 19110, command: python hetero_execution.py
==19110== Profiling application: python hetero_execution.py
==19110== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   26.90%  4.8640us         3  1.6210us  1.3120us  2.0800us  [CUDA memcpy DtoD]
                   19.12%  3.4560us         1  3.4560us  3.4560us  3.4560us  fused_add_kernel0
                   15.58%  2.8160us         2  1.4080us  1.2160us  1.6000us  [CUDA memcpy HtoD]
                   11.33%  2.0480us         1  2.0480us  2.0480us  2.0480us  fused_log_subtract_kernel0
                   10.62%  1.9200us         1  1.9200us  1.9200us  1.9200us  fused_sqrt_kernel0
                    8.50%  1.5360us         1  1.5360us  1.5360us  1.5360us  fused_exp_kernel0
                    7.96%  1.4400us         1  1.4400us  1.4400us  1.4400us  [CUDA memcpy DtoH]

It this a bug or am I missing something?

imorinaga · February 18, 2019, 11:00am

I am not sure that this insight helps resolving issue, it seems that this happens when a func passes ir_pass.fuse_ops (called in https://github.com/dmlc/tvm/blob/master/python/tvm/relay/build_module.py#L269).

import tvm
import tvm.relay as relay
import numpy as np

fallback_device = tvm.context("cpu")
target = {"cpu": "llvm", "cuda": "cuda"}
dev_ctx = tvm.context("cuda")
cpu_ctx = fallback_device

x = relay.var("x", shape=(1, 10))
y = relay.var("y", shape=(10, 10))
add = relay.add(x, y)
sqrt = relay.sqrt(add)
_sqrt = relay.annotation.on_device(sqrt, dev_ctx)
log = relay.log(add)
subtract = relay.subtract(sqrt, log)
exp = relay.exp(subtract)
_exp = relay.annotation.on_device(exp, dev_ctx)

func = relay.Function([x, y], relay.Tuple(tvm.convert([_sqrt, _exp, exp])))
func = relay.ir_pass.infer_type(func)
func = relay.ir_pass.rewrite_annotated_ops(func, cpu_ctx.device_type)
func = relay.ir_pass.infer_type(func)
func = relay.Function(relay.ir_pass.free_vars(func.body[2]), func.body[2])

##Storage_device_info seems correct
Storage_device_info = tvm.relay.backend._backend.GraphPlanMemory(func)
for k,[[x],[y]] in Storage_device_info.items():
   print k
   print y

free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%0

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
%3 = device_copy(%2, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
%4 = log(%0) # ty=Tensor[(10, 10), float32]
%5 = subtract(%3, %4) # ty=Tensor[(10, 10), float32]
%6 = device_copy(%5, meta[relay.attrs.DeviceCopyAttrs][2]) # ty=Tensor[(10, 10), float32]
%6
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %y: Tensor[(10, 10), float32]
%y

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%1
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
%x

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
%3 = device_copy(%2, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
%4 = log(%0) # ty=Tensor[(10, 10), float32]
%5 = subtract(%3, %4) # ty=Tensor[(10, 10), float32]
%5
# meta data omitted. you can use show_meta_data=True to include meta-data

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
%3 = device_copy(%2, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
%3
# meta data omitted. you can use show_meta_data=True to include meta-data

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
%2
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = device_copy(%0, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
%2 = sqrt(%1) # ty=Tensor[(10, 10), float32]
%3 = device_copy(%2, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
%4 = log(%0) # ty=Tensor[(10, 10), float32]
%5 = subtract(%3, %4) # ty=Tensor[(10, 10), float32]
%6 = device_copy(%5, meta[relay.attrs.DeviceCopyAttrs][2]) # ty=Tensor[(10, 10), float32]
%7 = exp(%6) # ty=Tensor[(10, 10), float32]
%7
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = add(%x, %y) # ty=Tensor[(10, 10), float32]
%1 = log(%0) # ty=Tensor[(10, 10), float32]
%1

1

func = relay.ir_pass.infer_type(func)
func = relay.ir_pass.fuse_ops(func, 1)
func = relay.ir_pass.infer_type(func)

#Storage_device_info seems wrong(only free vars are on 1)
Storage_device_info = tvm.relay.backend._backend.GraphPlanMemory(func)
for k,[[x],[y]] in Storage_device_info.items():
    print k
    print y

free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%5
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%2

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%6 = fn(%p02: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %7 = sqrt(%p02) # ty=Tensor[(10, 10), float32]
  %7
}
%8 = %6(%5) # ty=Tensor[(10, 10), float32]
%9 = fn(%p03: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %10 = device_copy(%p03, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
  %10
}
%11 = %9(%8) # ty=Tensor[(10, 10), float32]
%12 = fn(%p04: Tensor[(10, 10), float32],
         %p11: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %13 = log(%p11) # ty=Tensor[(10, 10), float32]
  %14 = subtract(%p04, %13) # ty=Tensor[(10, 10), float32]
  %14
}
%15 = %12(%11, %2) # ty=Tensor[(10, 10), float32]
%16 = fn(%p05: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %17 = device_copy(%p05, meta[relay.attrs.DeviceCopyAttrs][2]) # ty=Tensor[(10, 10), float32]
  %17
}
%18 = %16(%15) # ty=Tensor[(10, 10), float32]
%19 = fn(%p06: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %20 = exp(%p06) # ty=Tensor[(10, 10), float32]
  %20
}
%21 = %19(%18) # ty=Tensor[(10, 10), float32]
%21
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %y: Tensor[(10, 10), float32]
%y

1
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%6 = fn(%p02: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %7 = sqrt(%p02) # ty=Tensor[(10, 10), float32]
  %7
}
%8 = %6(%5) # ty=Tensor[(10, 10), float32]
%9 = fn(%p03: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %10 = device_copy(%p03, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
  %10
}
%11 = %9(%8) # ty=Tensor[(10, 10), float32]
%11
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%6 = fn(%p02: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %7 = sqrt(%p02) # ty=Tensor[(10, 10), float32]
  %7
}
%8 = %6(%5) # ty=Tensor[(10, 10), float32]
%9 = fn(%p03: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %10 = device_copy(%p03, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
  %10
}
%11 = %9(%8) # ty=Tensor[(10, 10), float32]
%12 = fn(%p04: Tensor[(10, 10), float32],
         %p11: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %13 = log(%p11) # ty=Tensor[(10, 10), float32]
  %14 = subtract(%p04, %13) # ty=Tensor[(10, 10), float32]
  %14
}
%15 = %12(%11, %2) # ty=Tensor[(10, 10), float32]
%16 = fn(%p05: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %17 = device_copy(%p05, meta[relay.attrs.DeviceCopyAttrs][2]) # ty=Tensor[(10, 10), float32]
  %17
}
%18 = %16(%15) # ty=Tensor[(10, 10), float32]
%18
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%6 = fn(%p02: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %7 = sqrt(%p02) # ty=Tensor[(10, 10), float32]
  %7
}
%8 = %6(%5) # ty=Tensor[(10, 10), float32]
%8
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
free_var %y: Tensor[(10, 10), float32]
%0 = fn(%p0: Tensor[(1, 10), float32],
        %p1: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %1 = add(%p0, %p1) # ty=Tensor[(10, 10), float32]
  %1
}
%2 = %0(%x, %y) # ty=Tensor[(10, 10), float32]
%3 = fn(%p01: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %4 = device_copy(%p01, meta[relay.attrs.DeviceCopyAttrs][0]) # ty=Tensor[(10, 10), float32]
  %4
}
%5 = %3(%2) # ty=Tensor[(10, 10), float32]
%6 = fn(%p02: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %7 = sqrt(%p02) # ty=Tensor[(10, 10), float32]
  %7
}
%8 = %6(%5) # ty=Tensor[(10, 10), float32]
%9 = fn(%p03: Tensor[(10, 10), float32])
        -> Tensor[(10, 10), float32] {
  %10 = device_copy(%p03, meta[relay.attrs.DeviceCopyAttrs][1]) # ty=Tensor[(10, 10), float32]
  %10
}
%11 = %9(%8) # ty=Tensor[(10, 10), float32]
%12 = fn(%p04: Tensor[(10, 10), float32],
         %p11: Tensor[(10, 10), float32])
         -> Tensor[(10, 10), float32] {
  %13 = log(%p11) # ty=Tensor[(10, 10), float32]
  %14 = subtract(%p04, %13) # ty=Tensor[(10, 10), float32]
  %14
}
%15 = %12(%11, %2) # ty=Tensor[(10, 10), float32]
%15
# meta data omitted. you can use show_meta_data=True to include meta-data

2
free_var %x: Tensor[(1, 10), float32]
%x

1

zhiics · February 18, 2019, 4:55pm

It looks this is a bug. I will look into it. Thanks for reporting it.

aca88 · February 18, 2019, 6:34pm

Hi @kazum @zhiics
I know most of the code is from the tests/python/relay/test_pass_annotation.py but I dont really understand the intuition of how the code does what it is supposed to. Would you mind commenting on my questions?

x = relay.var("x", shape=(1, 10))
y = relay.var("y", shape=(10, 10))
add = relay.add(x, y)
sqrt = relay.sqrt(add)
_sqrt = relay.annotation.on_device(sqrt, dev_ctx)
log = relay.log(add)
subtract = relay.subtract(sqrt, log)

That last line, why isn’t it subtract = relay.subtract(_sqrt,log) ?? I mean I would guess that _sqrt is just a copy of the original one with the added annotation. So why not give it as input to subtract?
What does this line do func = relay.Function([x, y], relay.Tuple(tvm.convert([_sqrt, _exp, exp]))) ? Is this basically saying “replace the copies of _sqrtand _exp in the graph with output exp and create a new function”?
Why do you call relay.ir_pass.rewrite_annotated_ops(func, cpu_ctx.device_type) and not relay.ir_pass.rewrite_annotated_ops(func, dev_ctx.device_type)? My intuition would have been that rewrite_annoted_ops would require the dev_ctx device and not the cpu_ctx
How does relay.build() handle the fact that now target is a dictionary?

Thanks a lot

zhiics · February 18, 2019, 6:37pm

@kazum, @imorinaga The problem is because I stepped into the fused op and append them to the post_dfs_order list. I should haven’t done that because we should only consider the call nodes and check the copy_copy node in the callee function. I will fix it. Sorry for the inconvenience.

Best,
Zhi

zhiics · February 18, 2019, 7:10pm

@aca88 Thanks for asking.

Yes, we can do it as the way you are mentioning here and replace annotation nodes with copy nodes, but users have to re-connect the AST manually. It might not be convenient when the network is large. So I let users to annotate the expr and reconnect it in the program later.
This is related to your question 1. Yes, if we do it as the way you mentioned above we can rewrite the program instead of annotating it. Otherwise, we need to pass the annotation nodes to make sure we can collect them when we traverse the tree from the exit node.
The second argument in rewrite_annotated_ops is the fallback device type. It could be any device. In the example, I let the nodes that are not specifically annotated fall back to the cpu.
please refer to the code here: https://github.com/dmlc/tvm/blob/master/python/tvm/relay/build_module.py#L264
Only the target parameter in the build interface is changed. The fallback is passed through config.

Hopefully this helps.

aca88 · February 18, 2019, 7:08pm

Thanks for the fast reply

Yes this realy helped a lot.

I will try to look into the code and see if I can find any other questions

kazum · February 19, 2019, 8:08pm

@zhiics, thanks for your fix in https://github.com/dmlc/tvm/pull/2622. It looks like solving the problem.

I have another question. Can we remove “relay.ir_pass.rewrite_annotated_ops” from my example? I think relay.build will apply the pass at https://github.com/dmlc/tvm/blob/master/python/tvm/relay/build_module.py#L352, but the code looks not working as expected.

I tried the following code and nvprof said that GPU is not used at all.

import tvm
import tvm.relay as relay
import numpy as np

fallback_device = tvm.context("cpu")
target = {"cpu": "llvm", "cuda": "cuda"}
dev_ctx = tvm.context("cuda")
cpu_ctx = fallback_device

x = relay.var("x", shape=(1, 10))
y = relay.var("y", shape=(10, 10))
add = relay.add(x, y)
sqrt = relay.sqrt(add)
_sqrt = relay.annotation.on_device(sqrt, dev_ctx)
log = relay.log(add)
subtract = relay.subtract(sqrt, log)
exp = relay.exp(subtract)
_exp = relay.annotation.on_device(exp, dev_ctx)

func = relay.Function([x, y], relay.Tuple(tvm.convert([_sqrt, _exp, exp])))
func = relay.Function(relay.ir_pass.free_vars(func.body[2]), func.body[2])

x_data = np.random.rand(1, 10).astype('float32')
y_data = np.random.rand(10, 10).astype('float32')
params = {"x": x_data, "y": y_data}

with relay.build_config(opt_level=1):
   graph, lib, params = relay.build(func, target=target, params=params)

module = tvm.contrib.graph_runtime.create(graph, lib, [cpu_ctx, dev_ctx])
module.set_input(**params)
module.run()
module.get_output(0).asnumpy()

zhiics · February 19, 2019, 11:50pm

Yes, this is actually related to the question @aca88 asked. If you pass it that way without connecting the node. The annotation nodes going away. You can probably print the func before you call build. You will see it is the same as the original graph. Or you can probably add:

func = ir_pass.infer_type(func)
func = expr.Function(ir_pass.free_vars(func.bo  dy[-1]), func.body[-1])
# or in a better way if you only have one output:
func = ir_pass.infer_type(func)
if isinstance(func.body, (list, tuple, expr.Tuple):
    func = expr.Function(ir_pass.free_vars(func.bo  dy[-1]), func.body[-1])

before device_map at line 355 in https://github.com/dmlc/tvm/blob/master/python/tvm/relay/build_module.py,
and you can then pass func = relay.Function([x, y], relay.Tuple(tvm.convert([_sqrt, _exp, exp]))). I think then you are the same. The reason I didn’t put these two lines there is because this might affect when you have multiple outputs. I planned to have PR to support this but I didn’t have much time recently. You are welcome to work on it if you are interested.

kazum · February 20, 2019, 5:43pm

@zhiics, thanks for your explanation. I’ve understood how heterogeneous execution is implemented in TVM. The current flow to use heterogeneous execution looks reasonable to me.

Thanks a lot!