How to map nn.conv2d to VTA?

I’m studying the VTA design and how it is being mapped to TVM. The resnet18 tutorial is good, however, the resnet18 itself is too complicated to follow. Instead, I’m trying with a simple nn.conv2d + nn.relu network as below:

def conv2d(data, weight=None, **kwargs):
    name = kwargs.get("name")
    kwargs.pop("name")
    if not weight:
        weight = relay.var(name + "_weight")
    return relay.nn.conv2d(data, weight, **kwargs)

def conv_block(data, name, channels, kernel_size=(3, 3), strides=(1, 1),
               padding=(1, 1), epsilon=1e-5):
    conv = conv2d(
        data=data,
        channels=channels,
        kernel_size=kernel_size,
        strides=strides,
        padding=padding,
        data_layout='NCHW',
        name=name+'_conv')
    act = relay.nn.relu(data=conv)
    return act
...
data_shape = (1, 3, 224, 224)
kernel_shape = (32, 3, 3, 3)
dtype = "float32"
data = relay.var("data", shape=data_shape, dtype=dtype)
act = conv_block(data, "graph", 32, strides=(2, 2))
func = relay.Function([data], act)
net = relay.frontend.common.infer_type(func)
...
mod = IRModule.from_expr(net)
...
with relay.build_config(opt_level=3, disabled_pass={"AlterOpLayout"}):
    with vta.build_config(debug_flag=1):
        graph, lib, params = relay.build(relay_prog, target, params=params, target_host=env.target_host

However, I noticed the conv2d is always being mapped to arm_cpu side, instead of VTA/FPGA side. Tracing the reason into vta_conv2d.py shows this is because of the data_layout=‘NCHW’ specified in conv2d definition. How can I change this definition to allow this conv2d to be mapped to VTA side? Thanks.

@_strategy.conv2d_strategy.register("vta")
def conv2d_strategy_vta(attrs, inputs, out_type, target):
    """conv2d vta strategy"""
    strategy = OpStrategy()
    kernel = inputs[1]
    dilation = topi.util.get_const_tuple(attrs.dilation)
    groups = attrs.groups
    layout = attrs.data_layout

    assert dilation == (1, 1), "support for dilation limited to (1, 1)"
    if is_packed_layout(layout):
        if groups == 1:
            env = get_env()
            assert env.LOG_INP_WIDTH == 3, "only support 8bit inp for now"
            assert env.LOG_WGT_WIDTH == 3, "only support 8bit wgt for now"
            assert kernel.dtype == "int8"

            strategy.add_implementation(
                _strategy.wrap_compute_conv2d(conv2d_packed, True),
                _strategy.wrap_topi_schedule(schedule_conv2d_packed),
                name="conv2d_packed.vta")
        else: # group_conv2d
            strategy.add_implementation(
                _strategy.wrap_compute_conv2d(group_conv2d_packed, has_groups=True),
                _strategy.wrap_topi_schedule(schedule_group_conv2d_packed),
                name="group_conv2d_packed.vta")
        return strategy

    # If it's not packed, run on ARM CPU
    print("vta/python/vta/top/op.py: conv2d layout not packed, run on ARM CPU.")
    arm_tgt = tvm.target.arm_cpu(target.model)
    return _strategy.arm_cpu.conv2d_strategy_arm_cpu(attrs, inputs, out_type, arm_tgt)

I modified source code to mimic deploy_classification.py to include quantization and graph_pack() process, now the compiling process goes well until it started lowering the conv2d + relu function:


tvm/python/tvm/relay/backend/compile_engine.py, select_implementation(), op.name= nn.conv2d
  valid implementation  0 :  conv2d_packed.vta plevel= 10
  selected best_plevel_implementation:  conv2d_packed.vta
tvm/python/tvm/relay/backend/compile_engine.py, select_implementation(), op.name= nn.relu
  valid implementation  0 :  injective.cpu plevel= 10
  selected best_plevel_implementation:  injective.cpu
tvm/python/tvm/relay/backend/_backend.py: lower function:  fused_nn_conv2d_nn_relu
lower phase 0
lower phase 1
Traceback (most recent call last):
...
  [bt] (1) /work/git_repo/tvm/build/libtvm.so(tvm::tir::CopyIntrinInjector::VisitStmt_(tvm::tir::AttrStmtNode const*)+0x1b8) [0x7fa6cd7d6308]
  [bt] (0) /work/git_repo/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4a) [0x7fa6cd3ce070]
  T_relu[(((ax2.outer*896) + (ax3.outer*16)) + ax5)] = max(res[ax5], 0)
  File "/work/git_repo/tvm/src/tir/pass/inject_copy_intrin.cc", line 49
  File "tvm/_ffi/_cython/./packed_func.pxi", line 54, in tvm._ffi._cy3.core.tvm_callback
  File "/work/git_repo/tvm/python/tvm/relay/backend/_backend.py", line 62, in lower
    raise RuntimeError(msg)
...
  [bt] (1) /work/git_repo/tvm/build/libtvm.so(tvm::tir::CopyIntrinInjector::VisitStmt_(tvm::tir::AttrStmtNode const*)+0x1b8) [0x7fa6cd7d6308]
  [bt] (0) /work/git_repo/tvm/build/libtvm.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4a) [0x7fa6cd3ce070]
  T_relu[(((ax2.outer*896) + (ax3.outer*16)) + ax5)] = max(res[ax5], 0)
  File "/work/git_repo/tvm/src/tir/pass/inject_copy_intrin.cc", line 49
TVMError: Check failed: MatchCopyPattern(op->body, &ret): Cannot match copy pattern of for (ax5, 0, 16) {
}

Apparently, nn.conv2d has been mapped to VTA (conv2d_packed.vta), but nn.relu is on CPU side(injective.cpu), thus the copy intrinsic insertion is required. However, it complains about “Cannot match copy pattern for for (ax5, 0, 16) { }” which leads me no clue. It is appreciated anyone can provide any hint.

Another question, why the nn.relu cannot be mapped to ALU?

@jinchenglee You might interested in look into the test_vta_insn.py for how relu is mapped to ALU, and look into test_benchmark_topi_conv2d.py for how a simple conv2d can be off-loaded onto VTA hardware.

@liangfu, thanks for your reply. Those examples use tensor expression directly to construct compute and schedule, then calls vta.build(schedule, …). I want to use relay.build() to directly compile relay IR which is closer to neural network import flow.

Any idea?

1 Like

Any update on this? I’m facing a similar error.

1 Like