[VTA&TVM] Questions after investigating resnet.py tutorial

aca88 · February 27, 2019, 1:57pm

Introduction

Hello all. I have been interested in the TVM stack for a while. Especially I find the example of the VTA backend as a Unique Selling Point of TVM for me.
I might be wrong (correct me if I am wrong), but TVM is the only framework with an example of how to add an accelerator which does not have LLVM or other established compilation flow. So cudos from my side.

I looked into all TVM tutorials, but concentrated on understanding the VTA tutorials.
I thought to myself “I only need to know how to plug-in to TVM and then all will be fine”.
The tutorials are REALLY helpful to get a first understanding, but there are just some things that no matter how often I run the code I still dont understand.

So I have compiled a list of questions. Some are specific to the VTA and some are more TVM general. But in both cases I have mostly used the resnet.py example to come to these questions.

It is somewhat lengthy and therefore I apologize for the long read.

Questions

I was wondering about the graph which is read here. The format seems odd. NNVM is able to read many graphs from different frontends.
1. How was this .json file generated and why wasn’t one of the supported frontends used?
2. Why does the graph have many “non-standard” (all the clips, rshift, and casts) DL nodes?
  My intuition tells me that it was necessary to check (assert?) if what is computed in the CPU+VTA architecture is the same as what is computed in the CPUonly case.
The functions defined in vta/graph.py according to my intuition are used to parse the graph, since nnvm cannot do it at this level.
1. clean_cast seems to transform the json nodes into nnvm.symbols. Is this basically the “frontend parser” of the type of json description?
2. _clean_cast What is this function doing? How is it doing it (code intuition since no comments are available) Why was it necessary?
3. clean_conv What is this function doing? How is it doing it (code intuition since no comments are available) Why was it necessary?
4. pack I understand that the conv2ds need to be in a certain layout for the VTA to process them so that’s fine, but why are max_pool2d calling _pack_batch_channel while global_avg_pool2d calls _unpack_batch_channel. Neither operations should be calculated on the accelerator of the VTA so why change their layouts? (i.e. if this is a common optimization why isn’t it part of the normal nnvm passes?)
After the graph has been parsed and the above functions have been called, the actual nnvm/tvm stack compilation flow is called. Obviously the vta.build_config is used which extended the tvm lowering passes by those required for communicating with the VTA runtime. In general this is clear to me, but some specifics are still not clear (for me).
- In nnvm/compiler/build_module.py: build This function is supposed to be the high level graph optimization. What bothers me is more the way the code looks. The PruneCompute and OpFusion are part of the build function, while AlterOp, SimplifyInference and FoldScaleAxis are handled by the optimize function.
  - These questions are more TVM general.
  1. Why was such a distinction done?
  2. Although optimize states “being target independent” it uses the with target: statement. Is there any target specific information needed for the optimize function?
  3. Is there no way to add nnvm passes? (the lowering process allows user defined passes, but I don’t see anything similar here)
  4. Does “PrecomputePrune” split the graph in two partitions? (i.e. the intersection is empty)
  5. “GraphFuse” is actually not fusing operators. I get the same node size in the graph before and after, but after “GraphCompile” the nodes
    are reduced. Is this always so? (I would have expected “GraphFindFusibleGroups” to identify what can be fused and then “GraphFuse” to actually fuse)
- In vta/top/vta_conv2d.py
  - These questions are more VTA specific.
  1. Why was the compute rule for clip defined, since TOPi already has a clip operator?
  2. In schedule_packed_conv the elementwise operations seem to be mapped to ALU opcodes of the VTA accelerator. This seems to work well for the example given, but I could construct a graph with some elementwise operation (right after the conv, like taking the abs) and this would not be mappable directly to an ALU opcode.
    How was it guaranteed that only mappable elementwise operations were part of the fused operator? (my intuition is that this was given partially by the designers knowing which graph they need to parse and by some of the functions in vta/graph.py)
  3. Also concerning schedule_packed_conv, when is this function actually called? (in the call stack I can see that “GraphCompile” calls it but I cant really pinpoint where exactly in the internal API it is being called)
  - This question is more TVM general
  1. What are the level parameters in the compute and schedule registering actually used for and what is a good number?
- In vta/environment.py
  - This question is more TVM general.
  1. What are the statements tvm.register_func(“tvm.info.memtvm.info.mem.%s % <some_string>”) actually doing? how is this information being used downstream in the compilation stack?
- I think I understand the general concept of the lowering part.
  While the conv2d schedule is constructed some pragmas are inserted. These are used by the functions inside vta/ir_pass.py. These are therefore the “VTA backend” (I know they only generate VTA runtime directive but still). These are added by the vta/build_module.py functions into the lowering phases of TVM tvm/build_config.py. Therefore at the end of “GraphCompile” we have llvm code for those operators which map to the ARM core and VTARuntime calls for those operators which are mapped to the VTA accelerator.
1. Is this accurate? Am I missing something?
In general, I have had trouble to understand some of the execution of the code, due to the fact that python calls functions of the precompiled shared library.
1. What is the recommended setup to debug both at the Python API level and the internal C++ API level?
2. I seem to have problems accessing some internal variable of some objects when debugging. Can it be that objects which get internal variables in the C++ code dont update the dict ?

aca88 · February 15, 2019, 8:43am

@thierry @Ravenwater do you have any insights that might help me? I would be very grateful

Ravenwater · February 15, 2019, 1:01pm

@aca88 brilliant set of questions. These questions will be valuable to strengthen the documentation, both use and development docs.

Let’s work together through these issues, and document answers and expand examples and tutorials.

aca88 · February 27, 2019, 1:57pm

I seem to have stumbled upon another question.
This time it is regarding fused operations.
Again, in the resnet.py example the nnvm HW independent optimization fuses some of the original nodes into one operation.
More specifically, the operation fusion generates the “fuse_conv2d___rshift_scalar___clip_cast_” operation (which were 5 operations before).
I have set a break point in vta_conv2d.py@372 (s = tvm.create_schedule(output.op)).
Now, because this is a fused operation, output.op is actually the output of the cast operator.

My question is, how did we get from fuse_conv2d___rshift_scalar___clip_cast_ to calling the schedule of the conv2d?
In the call stack I see
grafik
build_module.py@305 graph = graph.apply("GraphCompile") which actually is a call to a C++ function GraphCompile.
By reading the C++ source I think that the schedule is first generated in GetScheduleArgs@239 Schedule sch = fschedule[idx[master_idx].source->op()]( idx[master_idx].source->attrs, outs, target);

Is this right up until now?
NOTE: if this is right, then my question #14 is answered here

Now, master_idx is actually an input parameter of GetScheduleArgs and was actually calculated in GraphCompile@115

// Find master idx in the subgraph.
    int sub_master_idx = -1;
    for (uint32_t i = 0; i < subidx.num_nodes(); i++) {
      if (subidx[i].source->op() == idx[master].source->op()) {
        sub_master_idx = i;
        break;
      }
}

In this case, the graph is the complete resnet graph and the subgraph is the fuse_conv2d___rshift_scalar___clip_cast subgraph?
What does it mean to be a “master” node? is it just the first node of a fused graph? or are there other criteria? (i.e. in this case, why is conv2d the master node?)
How can I force certain nodes to always be “master” nodes. As in, if I had an accelerator that (for some reason) is expecting pooling layers as “master” nodes how do I force this?
Is it always “safe” to just generate the schedule of the master node? I mean at a high level I guess so. All fused operation should by linked to some axis of the master node, but I don’t know if this intuition is correct.

aca88 · February 28, 2019, 8:37am

I have found another question, this one is very specific the VTA:

Following the 2D convolution optimization tutorial, I understand most of what is going on except for one part.
The last output code for the convolution (I have deleted some code for clarity of my question) looks as follows:

produce res {
  for (i2.outer, 0, 2) {//Begin InputYdim_OuterLoop
    produce res_conv {
      for (ic.outer, 0, 16) { //Begin InputChannel_OuterLoop
          VTAUopLoopBegin(8, 98, 0, 9) //Begin OutputChannel_InnerLoop
          VTAUopLoopBegin(7, 14, 16, 0) //Begin InputYdim_InnerLoop
          for (dy, 0, 3) { //Begin Ykernel
            for (dx, 0, 3) {//Begin Xkernel
              for (j, 0, 14) {//Begin InputXdim_Loop
               //This next line are OutputChannel_Tensorize and InputChannel_Tensorize (both 0...15)
                VTAUopPush(0, 0, ((cthread.s*784) + j), ((cthread.s*144) + (((16*dy) + dx) + j)), ((cthread.s*72) + ((3*dy) + dx)), 0, 0, 0) 
              }//End InputXdim_Loop
            }//End Xkernel
          }//End Ykernel
          VTAUopLoopEnd() //End InputYdim_InnerLoop
          VTAUopLoopEnd() //End OutputChannel_InnerLoop
        }//End InputChannel_OuterLoop
       /*here many lines are not shown*/
    }//END produce res_conv
  }//End InputYdim_OuterLoop
}//END produce res

My questions are:

Are the loops commented with the correct semantic loop?
How are the dy,dx&InputXdim loops being mapped to the VTA?
I thought that one would define the VTAUopLoopBegin to be the 2 most inner loops of the computation of the VTA. Here there are 3 levels further inside the VTAUopLoops, before the VTAUopPush (for GEMM) actually gets called.
- I have a feeling that what is actually going to happen is that the dy,dx&InputXdim loops are going to be unrolled in order to generate source code which consists of 3*3*14=126 VTAUopPush instructions where the indices have been replaced by the required values.
- [EDIT:Answer]: Otherwise, it would actually mean that the OutputChannel_InnerLoop and InputYdim_InnerLoop are actually computed inside the Ykernel,Xkernel&InputXdim_Loop (which is not really what the source code is telling us).
  This answer is actually correct. VTAUopLoops are part of the innermost loop in the processing. This can be seen by starting from the RunGEMM in simdriver.cc where op->iter_out and op->iter_in are innermost (but above the tensorized dimensions) loop bounds defined by VTAUopLoopBegin instructions and incorporated into the GEMM instruction in PushGEMMOp in runtime.cc.
  The reason why the VTAUopLoops instructions are outside the other (3) loops is because the parameters are constant and therefore can be taken out of the loops. This optimization was done by the conv2d schedule designers of the VTA implicitly.

aca88 · March 15, 2019, 12:20pm

I found another question, this is again VTA specific.

I was wondering between the difference on how the VTA runtime functions are being injected into the schedule.
- A specific example would be the difference between “VTADepPush” and “VTAUopLoopBegin”
  - VTAUopLoopBegin is defined in the VTARuntime and is injected into the schedule in two ways (both being defined in ir_pass.py)
    - First way begin = tvm.call_extern("int32", "VTAUopLoopBegin", stmt.extent, *gemm_offsets)
    - Second way irb.emit(tvm.call_extern( "int32", "VTAUopLoopBegin", extent, dst_coeff[idx], src_coeff[idx], 0))
  - VTADepPush is defined in the VTARuntime and is exposed to TVM as part of the environment.py module
```
@tvm.register_func("tvm.intrin.rule.default.vta.coproc_dep_push")
def coproc_dep_push(op):
    return tvm.call_extern(
        "int32", "VTADepPush",
        get_env().dev.command_handle,
op.args[0], op.args[1])
```

So although they all rely on the tvm.call_extern() function, they are printed differently when print(vta.lower()) is called.
Example:

// attr [res_conv] storage_scope = "local.acc_buffer"
// attr [data_buf] storage_scope = "local.inp_buffer"
// attr [kernel_buf] storage_scope = "local.wgt_buffer"
produce res {
  vta.coproc_dep_push(3, 2) //<=======VTADepPush!!!!!!!!
  vta.coproc_dep_push(3, 2)//<=======VTADepPush!!!!!!!!!
  for (i2.outer, 0, 2) {
    produce res_conv {
      for (cthread.s, 0, 2) {
        // attr [iter_var(vta, , vta)] coproc_scope = 2
        vta.coproc_dep_pop(3, 2)
        // attr [iter_var(vta, , vta)] coproc_uop_scope = "VTAPushGEMMOp"
        VTAUopLoopBegin(8, 98, 0, 0)
        VTAUopLoopBegin(7, 14, 0, 0)
/*Rest of output was deleted for conciseness*/

So in one case it is a call directly to the VTA runtime function while on the other hand it is a call to the coproc_dep_push function. What is the difference between the two ways of calling into the VTARuntime?

When and where the vta.coproc_dep_push()calls replaced by their VTARuntime equivalents?

Thanks

faku · May 16, 2019, 4:15pm

Really interesting set of questions, really hope those will be answered in the future.

aca88 · May 17, 2019, 5:59am

Hey,

Thanks for the compliment. I actually have figured out some of the questions but have not managed to write them down here (its a mix between not being 100% certain and lack of time).
If you tell me which ones are of most interest to you, I could try and answer them.

thierry · May 17, 2019, 7:30am

@aca88, there are a lot of great questions here. For what it’s worth here’s a few answers:

1 The json was obtained via a manual transformation pass; this example is indeed non-standard. Relay to VTA compilation support should circumvent this problem (currently a WIP PR: https://github.com/dmlc/tvm/pull/3135)

2 These non-standard DL nodes let us map the graph onto VTA’s restricted operator support. It also lets us simplify the graph (e.g. we remove the need for division, or multiplication).

3-6 these are mostly ad-hoc functions to process the massaged JSON, therefore they will be deprecated once we add Relay to VTA compilation support

7-11 these questions on build_module would be best answered by @tqchen

12 this is because of a pattern detector bug in ir_pass.py, this needs to be fixed

13 if the operator cannot map to the ALU, we don’t fuse, and the fall back is to evaluate on the CPU. Indeed it would be something to be done at the graph level (as an Relay pass for instance)

14 it gets called when the schedule is constructed s = topi.generic.schedule_conv2d_nchw([res])

15 I believe level parameters are used to override a default compute/schedule definition

16 this defines the scope of the scratchpad. see this example for instance: https://github.com/dmlc/tvm/blob/master/tests/python/unittest/test_pass_storage_rewrite.py#L47-L81

17 that’s correct

18, 19 @tqchen @jroesch can comment

20 yes

21 yes

22-24 @tqchen can comment on master nodes in NNVM

25 I think that’s about right.

26 One way to answer your question is to use the debug flag in your build config, you can print out the micro-op code. See the recent post: VTA instruction set architecture Digging in the runtime.cc file was the right thing to do.

27 The difference here is because vta.coproc_dep_push is an intrinsic, and therefore then we lower it, it will display the intrinsic call. At code-gen, it will be replaced by the call to the runtime API. I believe we went with the intrinsic approach because it made it easier to implement the co-processor pass under src/pass/coproc_sync.cc.

superfu · September 25, 2019, 9:30am

hello, could you tell me what’s the meaning of vta.coproc_dep_push()?
It has confused me for a long time.
Thanks