[RFC] New GraphRuntime API to share parameter between modules


#1

The background is we have RNN language model. Sometimes, we only need to feed in the inputs and update hidden states, and only execute the output layer computation for last input. The output layer is usually very large and time-consuming in a full forward operation.

As GraphRuntime does not provide control-flow logics now, we have to split our model to two parts, and perform the control-flow logics in C/C++ integration code. While we need to share parameters between them to save memory usage.

Proposed Solution:

  1. add “shared_input” in graph’s attributes
   "attrs": {
     ... ...
     "shared_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
  1. allow empty storage entry in SetupStorage, and defer SetupOpExecs if there is un-initialized data entry (with a warning message)

  2. add “set_shared_input” function which takes a NDArray instance, so the actual storage can be shared, and call SetupOpExecs to setup operators.

Here is the pull request, https://github.com/dmlc/tvm/pull/3489

Please help to review, thanks!


#2

First thing is “shared_input” is a bit misleading, because if you don’t do SetupStorage at all, you need to setup internal activation too, right?

We also have usecase internally similar to this. In our usecase, we want to compile multiple batch version and let the graph runtime share the storage (allocated to max batch size). I had a patch that didn’t get a time to upstream. But it’s good to see that there are similar usecases here so that we can have some common code work out. Our usecase seems to be a superset of this. Instead of setting individual inputs, we just setup shared storage with one call (SetupSharedStorage). Does that sound something reasonable?


#3

Our use case is simpler, just share the data nodes (i.e., "op"==null), so we don’t need to setup internal activation, if I understood correctly. “shared_input” might be misleading, it does not have to be “shared”, only an indication for “delayed initialization”, so how about "delayed_init_input"?

It makes a lot sense to have a function like “setup_shared_storage”, which can accept a map of “input_name”(s) and its associated NDArray instances. This will require to extend the existing TVMArgValue? I thought it’s fine to provide “set_shared_input” for individual inputs, and leave it to customer’s code composing a function to call it multiple times to setup all shared inputs. Just like “set_input”.

It’s probably better to add an extra bool flag in “set_input” function, to indicate if we shall copy the incoming NDArray. So the GraphRuntime::SetInput function would be, SetInput(int index, NDArray data_in, bool copy=true).


#4

Sorry, I misunderstood the original purpose a bit. As I looked through the code, it became clearer. Actually, I recently had a PR of similar purpose: https://github.com/dmlc/tvm/pull/3416. The goal is not only to use external input storage but also to make it low-overhead as we have quite some inputs to setup and the latency constraint is tight.