[VTA] Question about VTA HLS design

Hello!

I’ve a question about vta.cc, especially on compute module.

vta.cc / compute

if (insn.generic.opcode == VTA_OPCODE_FINISH) {
    // Set done flag if we reach a FINISH instruction
    done = 1;
  } else if (insn.generic.opcode == VTA_OPCODE_LOAD) {
    // Initialize indices
    memop_sram_T sram_idx = insn.mem.sram_base;
    memop_dram_T dram_idx = insn.mem.dram_base;
    if (insn.mem.memory_type == VTA_MEM_ID_UOP) {
      // Perform data transfer
      memcpy(&uop_mem[sram_idx],
             (const uop_T*) &uops[dram_idx],
             insn.mem.x_size * sizeof(uop_T));
    } else if (insn.mem.memory_type == VTA_MEM_ID_ACC) {
      // Perform data transfer from DRAM
      load_2d<bus_T, ACC_MAT_AXI_RATIO, VTA_ACC_ELEM_BYTES>(
          biases,
          acc_mem,
          sram_idx,
          dram_idx,
          insn.mem.y_size,
          insn.mem.x_size,
          insn.mem.x_stride);
    }
  } else if (insn.generic.opcode == VTA_OPCODE_GEMM) {
    gemm(raw_copy, uop_mem, acc_mem, inp_mem, wgt_mem, out_mem);
  } else if (insn.generic.opcode == VTA_OPCODE_ALU) {
    alu(raw_copy, uop_mem, acc_mem, inp_mem, wgt_mem, out_mem);
  }

What I want to do is to check how many load and compute operation are overlapped in actual hardware. (I’m using VTA ported on ZCU104 thanks to update version pynq v0.0.1)

However, I encountered two problem.

First, it was difficult for me to understand how the C code translates into the module as shown in the picture.

Second, the way to measure the overlapped cycles between load module and compute module.

Thank you for your help in advance.

1 Like

hello, jw lee, is this still on-going?

Hi, I kind of got an answer.

I understood them by comparing Vivado HLS and auto-generated verilog code.

How can I help you?

1 Like

Oh, I recently started working on vta, and currently I’m trying to see the overlapped cycles also, just like you said above. Can you give me some advises?

  1. Where can I find the generated code from C->verilog code of vta?
  2. Which part do I have to focus on, to see the parts, where the cycles are fused together? Thanks in advance :slight_smile:

Moreover, I’m trying to find how the vta works inside PYNQ board, how the code generates verilog bitstream file and make the VDLA work. I suppose you went there much earlier, so a little help would really be great help for me!!

There are some stuffs to go through.

I’ll explain from the beginner’s point of view.

Since I’m outside the lab, I’ll give you a guideline by words.

I recommend to read this Makefile

If you read this file, it first makes Vivado HLS an then generates Vivado project.

Therefore if you “make” by this Makefile, it will generate project.

You can find those C->verilog inside vta.src.

The file are not that human-readable, so it might be hard to find modules.

I recommend to use “grep -rnI module load” to find verilog file.

In order to find overlapped cycles, you should look for control state inside those verilog codes. (load, store, compute)

You should find control states that is related to actual load, store, compute.

If you find those states, you are able to AND those conditions and find overlapped cycles.

Bitstream generation is done by Vivado, so that part I’m not sure how it works.

-jwlee

Thank you for the reply.

Those advises will be my top priority!! Is it okay if I ask you any other questions via this channel?

Yeah, no problem :slight_smile:

-jwlee

I tried to go through the advises you gave me and now, I have several new questions!

  1. I can’t find vta.src , did you mean vta.cc? or vta/src? or did I just not find the right one?

  2. I think the control signals are mostly in fetch module, is this right?

  3. Did you find the overlapped cycles? How did you run it?

I am a beginner so there might be some fundamental problems in the questions themselves.

Anyways, thank you in advance !!

It’s inside tvm/vta/build/hardware/xilinx/vivado/(sth)/vta.srcs

If there isn’t you didn’t make the vivado project.

I think you should get your hands more dirty.

You should study more on the makefile.

Control state that I meant is something different.

For example, load fetches “input” and “weight” so there will be states that represent them accordingly.

In order to understand overlapping, you should study hard on TVM and VTA paper.

I recommend to follow VTA matrix multiply tutorial and first get a concept of overlapping.

Thank you, I found the sources file!

And I think you are perfectly right that I should try harder by myself :slight_smile:

One thing, about the overlapping, why I started trying look for it was because on the debugged instructions,

I could see the instruction stream and how many instructions were given on one project,

but I wanted to see how the actual flow is run inside pynq cuz actually those instructions will be overlapped in some ways which are not there on the debugger like the picture below

But anyways, thank you for your kind reply !!

That debug instruction serializes instructions so it won’t give you a clue how the overlapping works.

I attached a debug core in vivado project and saw overlapping between load and compute.

In order to do so I ported on zcu104 due to lack of resources in pynq-z1 board.

If you are not familiar with vivado, this might take some time.

I started with checking out if my code actually works in FPGA.

Because, until yesterday I did not ‘make’ the makefile of tvm/vta/hardware/xilinx, which means I did not have the bit file.

I just realized that if I don’t have the bit file, tvm just takes the default one from github.

While doing this, I got a question if you know the answer.

If I start vivado project, the project of pynq is based on xc7z020clg484-1, which should actually be 400 as the board says.

I connected the output signal that controls each modules(load, store, compute, gemm) to 4 leds and checked out that they(the control signals) are given but, I had to change the Project part to xc7z020clg400 to set the led output. And it works fine except, could not finish the work.

In other words, the last instruction which is FINISH does not work and the workflow can’t come back to my host side. I had to manually turn off PYNQ board power and crash it to stop.

I thought the only change I made that caused this fault was change of project part. So I think that might have caused the problem.

Do you know why it is set to a different one from PYNQ-Z1 board FPGA?

In my experience, I used xc7z020clg484-1 as the project is defined.

Changing the core into xc7z020clg400 might not be a solution.

I’m not sure about control signals since modules (load, compute, store) decodes whatever it’s in their queue.

In my opinion, you should connect LEDs when the command queues give data to the modules.

Do you have xc7z020clg484-1 FPGA on your pynq board?

Because mine has xc7z020clg400 on my board.

Thats why I was asking.

Also thanks for your advise that I should connect LEDs when modules take data from queue,

I just AND ed the axi control signals out of PS part that controls PL part in short. Do you think that might have caused the problem? All the modules (load, store, compute, gemm) did work and I could see it by the LEDs being turned on. It’s just that FINISH work did not work well.

I’m so happy I have someone to talk about this :slight_smile: thank you so much for replying

Seems like PYNQ-z1 has two version regarding SoCs, Zynq 7000 and Zynq 7020.

It will be matter of how did you exchanged xc7z020clg484-1 to xc7z020clg400.

How did you applied your core? (e.g., build a new project, replaced xc7z020clg484-1 to xc7z020clg400)

FINISH cmd is decoded by COMPUTE module.

There is a certain bit that represents status like followings.

#define VTA_OPCODE_LOAD 0
#define VTA_OPCODE_STORE 1
#define VTA_OPCODE_GEMM 2 
#define VTA_OPCODE_FINISH 3
#define VTA_OPCODE_ALU 4

You better first check whether that bit shows up!

-Jake

1 Like

Hi Jake and KJK,

I am currently looking at tracing the time it takes for the data to reach the fetch, load, and compute module, Can you please support with the best approach for this?

Hi there.

It’s been quite a while I have worked on it, I will tell you based on what I remember.

Since VTA is made using HLS, I have looked at how HLS translates the VTA and makes those Verilog files.

Using that information, I have put debug flags on that Verilog file synthesized by HLS.

Since the Pynq-Z1 had so limited resources, I have ported into ZCU104 to do so.