[VTA][Pynq][Chisel] Creating bitstream from Chisel source instead of HLS for Pynq-Z1

Question: I’m trying to generate a bitstream for Pynq-Z1 using the Chisel source code instead of using the HLS as given in the tutorial.

Things I have already done:

Generated VTA.DefaultPynqConfig.v from Chisel source.

For now I’m gui of Vivado 2018.3. I have created IP for the top level XilinxShell ensuring that the part number of Pynq is correct.

For reference, I have also created the bitstream using vivado_hls. After opening this project, in the block design, I have noticed that the structure of Vta HLS is different from the chisel (eg: Vta HLS does not contain VCR or VME).

I’m facing issues trying to interface the chisel generated Vta hardware. Any solutions or tips with regards to this would be of great help.

This is an interesting topic.

However, we don’t support end-to-end resnet18 inference yet, and I think that’s the reason we don’t have a script for compiling Chisel VTA for PYNQ.

Good news is that we are actively heading towards bringing end-to-end support for resnet18. Once that is done, I think the effort to bring scripts to compile Chisel VTA for PYNQ would be more meaningful.

An easy work-around would be creating a base module and wrap it around the XilinxShell.

@vegaluis @thierry Please correct me if I’m wrong.

That’s correct, there are some issues with the AXI-4 master in Chisel which can cause erroneous writes if the write crosses a 4k boundary. @vegaluis can comment on this issue so we can keep track of it (perhaps follow up with an issue so it’s publicly documented?)

@liangfu and @thierry, thanks for your inputs. I have been able to build the chisel code for Pynq-z1, and I have tried to test it with the existing matrix_multiply.py example code after reducing the size of the input matrices. It has been able to run but there are a couple of issues.

What I have done so far:

  • I had to reduce the instQueueEntries in core/Configs.scala since the design didn’t have enough space on the Z1. I believe since the de10 is a bigger board it wouldn’t have the same issues.
  • I had to reduce the matrix size from 16x16 to 8x8 (BLOCKIN and BLOCKOUT from 16 to 8). It was mostly due to timing constraint and I don’t know how to customize the fpga_clk (increase the width from the existing 10ns).
  • Changed the VTADevice class in the pynq_driver.cc to be similar to that of de10nano_driver.cc, as the chisel code is being used instead of the existing HLS generated hardware.

The issue we are facing is that the hardware only returns with the result occasionally. When it does return the test passes. At this point I am not sure exactly what is happening. It seems that the hardware gets stuck and does not set the finish flag. I am not sure where it is getting stuck though. I would like to know your opinion on this.

On de10nano, we only need 49% of logic utilization for the default VTA design (, although the all multipliers have been implemented on logic resource instead of DSP slices). Therefore, I think PYNQ-Z1 should have sufficient amount of resource to deploy Chisel VTA.