Hello TVM/VTA experts.
I’m trying to run VTA on my Xilinx/Avnet ultra96-v2, with Ultra96-PYNQ v2.4 image.
I was able to reproduce VTA tutorial of test_program_rpc.py and test_benchmark_topi_conv2d.py successfully, but met failed to test_vta_insn.py.
The fail seems depends on VTALoadBuffer2D instruction, but I couldn’t find why the instruction doesn’t work. So could you let me know any idea to fix please?
My Environment:
- Ubuntu 16.04
- Xilinx Vivado 2018.3
- TVM 588523d (25th Feb 2020)
- Avnet Ultra96-v2 board
- PYNQ 2.4 with slight modification (".set. 0xFF41A040 = 0x3" setting) – *1
Build:
- Build PYNQ 2.4 image with https://github.com/Avnet/Ultra96-PYNQ/releases Plus https://github.com/Xilinx/PYNQ , with “.set. 0xFF41A040 = 0x3” patch
- Write Ultra96-2.4.img to SD card with balenaEtcher
- Boot up my ultra96-v2 with the SD card
- Build VTA runtime on Ultra96-PYNQ (with vta/config/ultra96_sample.json + “USE_VTA_FPGA ON” of vta_config.json)
- Build TVM on Ubuntu16.04 Host PC (with vta/config/ultra96_sample.json)
Test:
- Run start_rpc_server.sh on Ultra96-PYNQ
- Run python test on Host PC
Result:
- test_runtime_array()@test_vta_insn.py
- test_save_load_out()@test_vta_insn.py
- test_padded_load()@test_vta_insn.py
- test_gemm()@test_vta_insn.py
- test_alu()@test_vta_insn.py
- test_relu()@test_vta_insn.py
- test_shift_and_scale()@test_vta_insn.py
- test_benchmark_topi_dense.py
- test_benchmark_topi_conv2d.py
- test_benchmark_topi_group_conv2d.py
- test_benchmark_topi_conv2d_transpose.py
When test_padded_load()@test_vta_insn.py failed, I saw the message,
File "tests/python/unittest/test_vta_insn.py", line 154, in _run
check_padded_load([2, 0, 0, 0], [0, 0, 0, 0], test_name="Y0")
To check how Matrix wrong, I wrote a code like below. From the results, no x_pad/y_pad was applied on Ultra96-PYNQ device.
I tried to build my bitstream (vta.bit), but the result was not changed.
I read load_pad_2d() of hardware/xilinx/src/vta.cc but couldn’t find any reason why padding doesn’t work…so please let me know any idea to resolve this issue.
Code:
from __future__ import absolute_import, print_function
import os
import tvm
import vta
import numpy as np
from tvm import rpc
from tvm.contrib import util
env = vta.get_env()
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.174.6")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))
remote = rpc.connect(host, port)
vta.reconfig_runtime(remote)
vta.program_fpga(remote, None)
import topi
m = 1
o = 2
pad_before = (0,1,0,0)
pad_after = (1,0,0,0)
x = tvm.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name="x", dtype=env.acc_dtype)
x_buf = topi.nn.pad(x, pad_before, pad_after, name="x_buf")
y_buf = tvm.compute((o+pad_before[0]+pad_after[0],
m+pad_before[1]+pad_after[1],
env.BATCH, env.BLOCK_OUT), lambda *i: x_buf(*i) >> 0, "y_buf")
y = tvm.compute((o+pad_before[0]+pad_after[0],
m+pad_before[1]+pad_after[1],
env.BATCH, env.BLOCK_OUT), lambda *i: y_buf(*i).astype(env.inp_dtype), "y")
s = tvm.create_schedule(y.op)
s[x_buf].set_scope(env.acc_scope)
s[x_buf].pragma(x_buf.op.axis[0], env.dma_copy)
s[y_buf].set_scope(env.acc_scope)
s[y_buf].pragma(y_buf.op.axis[0], env.alu)
s[y].pragma(y.op.axis[0], env.dma_copy)
print(vta.lower(s, [x, y], simple_mode=True)) # ---*2
with vta.build_config():
my_pad_load = vta.build(s, [x, y], "ext_dev", env.target_host)
temp = util.tempdir()
my_pad_load.save(temp.relpath("pad_load.o"))
remote.upload(temp.relpath("pad_load.o"))
f = remote.load_module("pad_load.o")
ctx = remote.ext_dev(0)
x_np = np.random.randint(1,10, size=(o, m, env.BATCH, env.BLOCK_OUT)).astype(x.dtype)
y_np = np.pad(x_np,
np.vstack([np.array(pad_before), np.array(pad_after)]).T
).astype(y.dtype)
x_nd = tvm.nd.array(x_np, ctx)
y_nd = tvm.nd.empty(y_np.shape, ctx=ctx, dtype=y_np.dtype)
f(x_nd, y_nd)
print("y_numpy:", y_np)
print("y_VTA:", y_nd.asnumpy())
The result was below. It seems no padding was applied on Ultra96’s FPGA memory , in spite of x_pad/y_pad of VTALoadBuffer2D() was set properly(*2).
y_numpy: [[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[2 3 6 4 2 1 8 2 6 8 5 4 7 6 5 7]]]
[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[7 8 9 4 1 7 4 1 6 3 4 6 8 1 2 4]]]
[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]]
y_VTA: [[[[2 3 6 4 2 1 8 2 6 8 5 4 7 6 5 7]]
[[7 8 9 4 1 7 4 1 6 3 4 6 8 1 2 4]]]
[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]
[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]]
*1 https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18842098/Zynq+UltraScale+MPSoC+Cache+Coherency Without this sequence, test_benchmark_topi_conv2d.py also fails (Conv2d on FPGA circuit never returns)
*2 According to VTA schedule printing, VTALoadBuffer2D()'s x_pad/y_pad parameter seems to be set properly.
// attr [x_buf] storage_scope = "local.acc_buffer"
// attr [iter_var(vta, , vta)] coproc_scope = 2
produce x_buf {
VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), x, 0, 1, 2, 1, 1, 0, 0, 1, 0, 3)
}
// attr [iter_var(vta, , vta)] coproc_scope = 2
// attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushALUOp"
produce y_buf {
VTAUopLoopBegin(6, 1, 1, 0)
VTAUopPush(1, 0, 0, 0, 0, 3, 1, 0)
VTAUopLoopEnd()
}
vta.coproc_dep_push(2, 3)
// attr [iter_var(vta, , vta)] coproc_scope = 3
vta.coproc_dep_pop(2, 3)
produce y {
VTAStoreBuffer2D(tvm_thread_context(VTATLSCommandHandle()), 0, 4, y, 0, 6, 1, 6)
}
vta.coproc_sync()