TVM for edge computing

In the 5G era, the network latency is ultra-low. An edge sever in the carrier network (e.g. AWS wavelength, MEC servers, etc) can be one of the targets to offload inference like AI accelerators in the device.

My team is working on a framework to offload inference to outside servers, and I wonder if we could implement it elegantly with TVM. I think of contributing features for that, but I’m not really sure if I’m on the right way. I’d like to hear comments before working on that.

Offload inference to edge server

I think of running an RPC server on the edge server and serving inference requests from edge devices.

                        +-------------+
                        | Edge server |
            Inference   |             |
+------+    offloading  |   +------+  |
| Edge | -----------------> | rpc  |  |
|device| <----------------- |server|  |
+------+     Results    |   +------+  |
                        |             |
                        +-------------+

The device sends a model to the edge server first, and after that, sends inference requests. However, the expected usage of the rpc server is that it runs on the device, I think. Is this the abuse of the rpc server?

I also think it might be good if the rpc sever would have an option to load a model on the edge server like TensorFlow Serving.

Offload inference partially

When an input model has branches, offloading some of them to the edge sever may reduce the total inference time.

                      compute
                     on server
                      +---+
                  +-->|   |---+
+-----+   +---+   |   +---+   |   +------+
|input|-->|   |---+           +-->|output|
+-----+   +---+   |   +---+   |   +------+
                  +-->|   |---+
                      +---+
                      compute
                     on device

I guess we could support this with the heterogeneous runtime feature. IIUC, the current TVM runtime doesn’t allow for mixing rpc and normal contexts, so it looks like we need a change for that.

Turn on offloading dynamically

Whether we need an edge offloading depends on the situation. It’s better to offload when the device is busy, but on the other hand, it’s not good when the network is unstable. I guess we can turn on offloading dynamically by adding another input to the model.

                       compute
                      on server
+-----+   +---+ X==1   +---+
|input|-->|   |------->|   |---+
+-----+   |   |        +---+   |   +------+
          |   |                +-->|output|
+-----+   |   | X==0   +---+   |   +------+
|  X  |-->|   |------->|   |---+
+-----+   +---+        +---+
                       compute
                      on device

The control input X determines whether the computation is offloaded to the edge server. I wonder if it is possible to prevent network transfer when X==0 in the above example.

@tqchen @zhiics @FrozenGene @eqy

Is it more akin to a hybrid serving framework? I could imagine we could build a runtime::Module that implements this feature. We could start by some prototypes then see how things goes. also cc @haichen who has experience in building serving systems

I’m not exactly sure what the hybrid serving framework is, but, yes, you’re probably correct - the framework we are working on serves inference requests on the so-called hybrid (mobile-edge-cloud) architecture. I’ll propose some code for further discussion.

@haichen, let us know if you have something you can share here :slight_smile:

The scenario definitely makes sense. But one thing worth to discuss is whether this functionality should be implemented fully in the TVM or instead build a system that utilizes TVM for local computation alone. I feel the latter would be more practical and the question becomes how TVM runtime module can better integrate into other system.