[RFC] New GraphRuntime API to share parameter between modules

The background is we have RNN language model. Sometimes, we only need to feed in the inputs and update hidden states, and only execute the output layer computation for last input. The output layer is usually very large and time-consuming in a full forward operation.

As GraphRuntime does not provide control-flow logics now, we have to split our model to two parts, and perform the control-flow logics in C/C++ integration code. While we need to share parameters between them to save memory usage.

Proposed Solution:

Solution:

  1. add “lazy_init_input” in graph’s attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
  1. allow empty storage entry in SetupStorage, and defer SetupOpExecs
    if there is un-initialized data entry
  2. add “set_shared_input” function which takes a NDArray instance, so
    the actual storage can be shared
  3. add “setup_operators” function to setup operators

Here is the pull request, https://github.com/dmlc/tvm/pull/3489

Please help to review, thanks!

1 Like

First thing is “shared_input” is a bit misleading, because if you don’t do SetupStorage at all, you need to setup internal activation too, right?

We also have usecase internally similar to this. In our usecase, we want to compile multiple batch version and let the graph runtime share the storage (allocated to max batch size). I had a patch that didn’t get a time to upstream. But it’s good to see that there are similar usecases here so that we can have some common code work out. Our usecase seems to be a superset of this. Instead of setting individual inputs, we just setup shared storage with one call (SetupSharedStorage). Does that sound something reasonable?

Our use case is simpler, just share the data nodes (i.e., "op"==null), so we don’t need to setup internal activation, if I understood correctly. “shared_input” might be misleading, it does not have to be “shared”, only an indication for “delayed initialization”, so how about "lazy_init_input"?

It makes a lot sense to have a function like “setup_shared_storage”, which can accept a map of “input_name”(s) and its associated NDArray instances. This will require to extend the existing TVMArgValue? I thought it’s fine to provide “set_shared_input” for individual inputs, and leave it to customer’s code composing a function to call it multiple times to setup all shared inputs. Just like “set_input”.

It’s probably better to add an extra bool flag in “set_input” function, to indicate if we shall copy the incoming NDArray. So the GraphRuntime::SetInput function would be, SetInput(int index, NDArray data_in, bool copy=true).

Sorry, I misunderstood the original purpose a bit. As I looked through the code, it became clearer. Actually, I recently had a PR of similar purpose: https://github.com/dmlc/tvm/pull/3416. The goal is not only to use external input storage but also to make it low-overhead as we have quite some inputs to setup and the latency constraint is tight.

@yinghai I saw your change was merged, it seems to me SetInputZeroCopy will directly change the data reference of tensors involved in operators. The original allocated storage in data_entry_ was not released?

During SetupStorage, seems for CPU device, NDArray::Empty will not only allocate the virtual memory space, but also set all elements to 0. This will cause the actual mapping of physical memory. So even we can release the original storage in data_entry_, it may still require amount of memory pre-allocated.

To avoid such situation, I think lazy_init_input will still be necessary.

I updated the RFC proposal and pull-request, could you please help to review?

Thanks!

Yes, it’s not. In this case you can mix SetInput with SetInputZeroCopy. But it might be hardly the usecase. Essentially, I didn’t do NDArray() instead of NDArray::Empty() just because I didn’t want to complicate the runtime. We can add the option to do NDArray() though. If you have that pull-request. I’ll take a look.

@yinghai I think it needs to be done in SetupStorage (using a dummy NDArray instance instead of NDArray::Empty). If I understood correctly, we don’t need to mix SetInput and SeInputZeroCopy, just add SetSharedInput to take a NDArray input should be fine.