Questions about memory latency hiding on DSP

Q1.
I read about the memory latency hiding method used in VTA. With the passes InjectVirtualThread and CoProcSync, it can generate stmt like this:

for (ko, 0, 16) {
    // attr [iter_var(vta, , vta)] coproc_scope = 1
    vta.coproc_dep_pop(2, 1)
    produce A_buf {
      VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), A, ko, 1, 1, 1, 0, 0, 0, 0, 0, 2)
    }
    produce B_buf {
      VTALoadBuffer2D(tvm_thread_context(VTATLSCommandHandle()), B, ko, 1, 16, 16, 0, 0, 0, 0, 0, 1)
    }
    vta.coproc_dep_push(1, 2)
    // attr [iter_var(vta, , vta)] coproc_scope = 2
    vta.coproc_dep_pop(1, 2)
    // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
    VTAUopLoopBegin(16, 1, 0, 1)
    VTAUopPush(0, 0, 0, 0, 0, 0, 0, 0)
    VTAUopLoopEnd()
    vta.coproc_dep_push(2, 1)
  }
  vta.coproc_dep_push(2, 3)
  vta.coproc_dep_pop(2, 1)

This works well when both data load and computation are handled by coprocessors(e.g. DMA and GemmCore here).

However when it comes to DSP, the main loop and computation are both running on the same core. Then latency hiding will fail since computation in the current loop can not cover data load in the next one.

Is there any pass can handle this problem on DSP? Or should I implement a new pass to mutate the stmt in a new way?

Q2.
Still on DSP. Based on the discussion on tiles’ spatial dependency here How does VTA handle tiling's spatial dependency? , it seems that we need a new pass to implement stmt using ring buffers to eliminate duplicate data transfering.

Any guide on how to code those passes will be appreciated since I really have no idea on how to start and what factors should take into account.

Thx!!!

3 Likes

Halide’s store_root schecule seems to implement ring buffer. Please Refer to the link blelow.
http://halide-lang.org/tutorials/tutorial_lesson_08_scheduling_2.html