[VTA][Pynq][Chisel] Creating bitstream from Chisel source instead of HLS for Pynq-Z1

Question: I’m trying to generate a bitstream for Pynq-Z1 using the Chisel source code instead of using the HLS as given in the tutorial.

Things I have already done:

Generated VTA.DefaultPynqConfig.v from Chisel source.

For now I’m gui of Vivado 2018.3. I have created IP for the top level XilinxShell ensuring that the part number of Pynq is correct.

For reference, I have also created the bitstream using vivado_hls. After opening this project, in the block design, I have noticed that the structure of Vta HLS is different from the chisel (eg: Vta HLS does not contain VCR or VME).

I’m facing issues trying to interface the chisel generated Vta hardware. Any solutions or tips with regards to this would be of great help.

This is an interesting topic.

However, we don’t support end-to-end resnet18 inference yet, and I think that’s the reason we don’t have a script for compiling Chisel VTA for PYNQ.

Good news is that we are actively heading towards bringing end-to-end support for resnet18. Once that is done, I think the effort to bring scripts to compile Chisel VTA for PYNQ would be more meaningful.

An easy work-around would be creating a base module and wrap it around the XilinxShell.

@vegaluis @thierry Please correct me if I’m wrong.

That’s correct, there are some issues with the AXI-4 master in Chisel which can cause erroneous writes if the write crosses a 4k boundary. @vegaluis can comment on this issue so we can keep track of it (perhaps follow up with an issue so it’s publicly documented?)

@liangfu and @thierry, thanks for your inputs. I have been able to build the chisel code for Pynq-z1, and I have tried to test it with the existing matrix_multiply.py example code after reducing the size of the input matrices. It has been able to run but there are a couple of issues.

What I have done so far:

  • I had to reduce the instQueueEntries in core/Configs.scala since the design didn’t have enough space on the Z1. I believe since the de10 is a bigger board it wouldn’t have the same issues.
  • I had to reduce the matrix size from 16x16 to 8x8 (BLOCKIN and BLOCKOUT from 16 to 8). It was mostly due to timing constraint and I don’t know how to customize the fpga_clk (increase the width from the existing 10ns).
  • Changed the VTADevice class in the pynq_driver.cc to be similar to that of de10nano_driver.cc, as the chisel code is being used instead of the existing HLS generated hardware.

The issue we are facing is that the hardware only returns with the result occasionally. When it does return the test passes. At this point I am not sure exactly what is happening. It seems that the hardware gets stuck and does not set the finish flag. I am not sure where it is getting stuck though. I would like to know your opinion on this.

On de10nano, we only need 49% of logic utilization for the default VTA design (, although the all multipliers have been implemented on logic resource instead of DSP slices). Therefore, I think PYNQ-Z1 should have sufficient amount of resource to deploy Chisel VTA.

Yes, it is true that the Pynq-Z1 should have enough resources if the de10nano is supporting the design that comfortably. Could it be possible that the memory required by instruction queues (inst_q) in Load, Compute and Store (which is quite large at 128bits x 512 x 3modules) is being successfully assigned to BRAMs by Quartus but not by Vivado. The reason I ask this is because reducing the size of instruction queues from 512 immediately solves my resource utilization issues.

I think that reducing the size of the instruction queues to 512 should be a reasonable fix for now since that should not affect performance. Correct synthesis to BRAMs is an issue; I’m unsure why Quartus and Vivado infer these memories differently.

That’s because instruction queues (inst_q) are implemented as Queue modules in current Chisel VTA, while BRAMs are designed to be correctly inferred when it is implemented with SyncReadMem module.

Does your implementation of Chisel-generated bitstream for DE10-Nano involve the queues to be stored on BRAMs?

If yes, can you provide the details on how you did it?

If not, how did you manage to reduce the resource utilization to 49%?

Queues in Chisel are synthesized to BRAM in Quartus by default.

After making the suggested changes, the timing analysis fails when I try to synthesize the Chisel code as the clock is running at 100 MHz. Is there any way to reduce the FPGA clock frequency? If yes, how do I do it in Vivado?