[VTA] VTALoadBuffer2D() padding instruction seems doesn't work on Xilinx/Avnet Ultra96v2

Hello TVM/VTA experts.

I’m trying to run VTA on my Xilinx/Avnet ultra96-v2, with Ultra96-PYNQ v2.4 image.

I was able to reproduce VTA tutorial of test_program_rpc.py and test_benchmark_topi_conv2d.py successfully, but met failed to test_vta_insn.py.

The fail seems depends on VTALoadBuffer2D instruction, but I couldn’t find why the instruction doesn’t work. So could you let me know any idea to fix please?

My Environment:

  • Ubuntu 16.04
  • Xilinx Vivado 2018.3
  • TVM 588523d (25th Feb 2020)
  • Avnet Ultra96-v2 board
  • PYNQ 2.4 with slight modification (".set. 0xFF41A040 = 0x3" setting) – *1

Build:

  • Build PYNQ 2.4 image with https://github.com/Avnet/Ultra96-PYNQ/releases Plus https://github.com/Xilinx/PYNQ , with “.set. 0xFF41A040 = 0x3” patch
  • Write Ultra96-2.4.img to SD card with balenaEtcher
  • Boot up my ultra96-v2 with the SD card
  • Build VTA runtime on Ultra96-PYNQ (with vta/config/ultra96_sample.json + “USE_VTA_FPGA ON” of vta_config.json)
  • Build TVM on Ubuntu16.04 Host PC (with vta/config/ultra96_sample.json)

Test:

  • Run start_rpc_server.sh on Ultra96-PYNQ
  • Run python test on Host PC

Result:

  • test_runtime_array()@test_vta_insn.py :ok:
  • test_save_load_out()@test_vta_insn.py :ok:
  • test_padded_load()@test_vta_insn.py :anger:
  • test_gemm()@test_vta_insn.py :ok:
  • test_alu()@test_vta_insn.py :ok:
  • test_relu()@test_vta_insn.py :ok:
  • test_shift_and_scale()@test_vta_insn.py :ok:
  • test_benchmark_topi_dense.py :ok:
  • test_benchmark_topi_conv2d.py :ok:
  • test_benchmark_topi_group_conv2d.py :ok:
  • test_benchmark_topi_conv2d_transpose.py :anger:

When test_padded_load()@test_vta_insn.py failed, I saw the message,

  File "tests/python/unittest/test_vta_insn.py", line 154, in _run
    check_padded_load([2, 0, 0, 0], [0, 0, 0, 0], test_name="Y0")

To check how Matrix wrong, I wrote a code like below. From the results, no x_pad/y_pad was applied on Ultra96-PYNQ device.

I tried to build my bitstream (vta.bit), but the result was not changed.

I read load_pad_2d() of hardware/xilinx/src/vta.cc but couldn’t find any reason why padding doesn’t work…so please let me know any idea to resolve this issue.

Code:

from __future__ import absolute_import, print_function
import os
import tvm
import vta
import numpy as np
from tvm import rpc
from tvm.contrib import util

env = vta.get_env()
host = os.environ.get("VTA_PYNQ_RPC_HOST", "192.168.174.6")
port = int(os.environ.get("VTA_PYNQ_RPC_PORT", "9091"))

remote = rpc.connect(host, port)
vta.reconfig_runtime(remote)
vta.program_fpga(remote, None)

import topi
m = 1
o = 2
pad_before = (0,1,0,0)
pad_after = (1,0,0,0)
x = tvm.placeholder((o, m, env.BATCH, env.BLOCK_OUT), name="x", dtype=env.acc_dtype)
x_buf = topi.nn.pad(x, pad_before, pad_after, name="x_buf")
y_buf = tvm.compute((o+pad_before[0]+pad_after[0],
                     m+pad_before[1]+pad_after[1],
                     env.BATCH, env.BLOCK_OUT), lambda *i: x_buf(*i) >> 0, "y_buf")
y = tvm.compute((o+pad_before[0]+pad_after[0],
                     m+pad_before[1]+pad_after[1],
                     env.BATCH, env.BLOCK_OUT), lambda *i: y_buf(*i).astype(env.inp_dtype), "y")
s = tvm.create_schedule(y.op)

s[x_buf].set_scope(env.acc_scope)
s[x_buf].pragma(x_buf.op.axis[0], env.dma_copy)
s[y_buf].set_scope(env.acc_scope)
s[y_buf].pragma(y_buf.op.axis[0], env.alu)
s[y].pragma(y.op.axis[0], env.dma_copy)

print(vta.lower(s, [x, y], simple_mode=True)) # ---*2

with vta.build_config():
    my_pad_load = vta.build(s, [x, y], "ext_dev", env.target_host)
temp = util.tempdir()
my_pad_load.save(temp.relpath("pad_load.o"))
remote.upload(temp.relpath("pad_load.o"))
f = remote.load_module("pad_load.o")
ctx = remote.ext_dev(0)
x_np = np.random.randint(1,10, size=(o, m, env.BATCH, env.BLOCK_OUT)).astype(x.dtype)
y_np = np.pad(x_np,
              np.vstack([np.array(pad_before), np.array(pad_after)]).T
             ).astype(y.dtype)
x_nd = tvm.nd.array(x_np, ctx)
y_nd = tvm.nd.empty(y_np.shape, ctx=ctx, dtype=y_np.dtype)
f(x_nd, y_nd)
print("y_numpy:", y_np)
print("y_VTA:", y_nd.asnumpy())

The result was below. It seems no padding was applied on Ultra96’s FPGA memory , in spite of x_pad/y_pad of VTALoadBuffer2D() was set properly(*2).

y_numpy: [[[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
  [[2 3 6 4 2 1 8 2 6 8 5 4 7 6 5 7]]]
 [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
  [[7 8 9 4 1 7 4 1 6 3 4 6 8 1 2 4]]]
 [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]]
y_VTA: [[[[2 3 6 4 2 1 8 2 6 8 5 4 7 6 5 7]]
  [[7 8 9 4 1 7 4 1 6 3 4 6 8 1 2 4]]]
 [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]
 [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]]]

*1 https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18842098/Zynq+UltraScale+MPSoC+Cache+Coherency Without this sequence, test_benchmark_topi_conv2d.py also fails (Conv2d on FPGA circuit never returns)

*2 According to VTA schedule printing, VTALoadBuffer2D()'s x_pad/y_pad parameter seems to be set properly.

// attr [x_buf] storage_scope = "local.acc_buffer"
// attr [iter_var(vta, , vta)] coproc_scope = 2
produce x_buf {
  VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), x, 0, 1, 2, 1, 1, 0, 0, 1, 0, 3)
}
// attr [iter_var(vta, , vta)] coproc_scope = 2
// attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushALUOp"
produce y_buf {
  VTAUopLoopBegin(6, 1, 1, 0)
  VTAUopPush(1, 0, 0, 0, 0, 3, 1, 0)
  VTAUopLoopEnd()
}
vta.coproc_dep_push(2, 3)
// attr [iter_var(vta, , vta)] coproc_scope = 3
vta.coproc_dep_pop(2, 3)
produce y {
  VTAStoreBuffer2D(tvm_thread_context(VTATLSCommandHandle()), 0, 4, y, 0, 6, 1, 6)
}
vta.coproc_sync()

In particular, my patch for PYNQ was below.

kuroy@kuroy-VirtualBox:~/git/PYNQ$ git diff
diff --git a/sdbuild/Makefile b/sdbuild/Makefile
index 7106a1a..0e9cfe0 100644
--- a/sdbuild/Makefile
+++ b/sdbuild/Makefile
@@ -130,7 +130,7 @@ $$(PL_PROJ_$1): $$(BSP_TARGET_$1)
        petalinux-config --oldconfig -p $$(PL_PROJ_$1)
 
 $$(BOOT_ROOT_$1)/BOOT.BIN : $$(BOOT_DEPENDS_$1) $$(BOOT_BITSTREAM_$1) | $$(BOOT_ROOT_$1)
-       cd $$(BOOT_ROOT_$1) && petalinux-package --boot --fpga $$(BITSTREAM_ABS_$1) --u-boot -p $$(PL_PROJ_$1) --force
+       cd $$(BOOT_ROOT_$1) && petalinux-package --boot --fpga $$(BITSTREAM_ABS_$1) --u-boot -p $$(PL_PROJ_$1) --force --bif-attribute init --bif-attribute-value $$(BUILD_ROOT)/regs.init
        cp -f $$(PL_PROJ_$1)/images/linux/BOOT.BIN $$(BOOT_ROOT_$1)
 
 $$(BOOT_ROOT_$1)/image.ub : $$(BUILD_ROOT_$1)/image.its $$(BUILD_ROOT_$1)/system.dtb $$(BUILD_ROOT_$1)/$$(KERNEL_$$(ARCH_$1)) | $$(BOOT_ROOT_$1)

And $(BUILD_ROOT)/regs.init is plain text file that contains only one line below.

.set. 0xFF41A040 = 0x3;

Patch above only affects to BOOT.bin, so you can try patched image by overwrite BOOT.bin in SD card created from original PYNQ image.

PYNQ compilation takes too much time, so I upload BOOT.bin I built for PYNQ2.4/2.5 of Ultra96v2. Base PYNQ 2.4/2.5 image for Ultra96v2 is available from here.

BOOT.bin for Ultra96v2 PYNQ 2.4(my google drive)

BOOT.bin for Ultra96v2 PYNQ 2.5

@kuroy, check_padded_load using VTALoadBuffer2D to load an accumulator buffer, but in vta.cc, we can saw that for VTA_MEM_ID_ACC ,the corresponding loading function is load_2d instead of load_pad_2d, if you have interest you can try change load_2d into load_pad_2d that should can fix the VTALoadBuffer2D issue and feel free to upstream related patch.