Below are some areas that it wasn’t entirely clear to me how they’d be handled. This relates more-so to parallelism, but seems like this particular approach to bringing your own codegen would also impact how parallelism is represented, so I’ll ask it here, but let me know if I should be looking at some different proposal for that.
-
Some single ops will need to be executed concurrently and cooperatively across multiple devices, how do we represent that? This is typical for sea-of-nodes hardware and in general for model parallelism.
-
Just because an op can run on a device doesn’t mean it should. For an extreme case, consider a TF graph that got split into several pieces and one of the pieces is just an addition by itself, or just a relu by itself. It doesn’t make sense to transfer the inputs to the device and retrieve the output back just to do an addition. Also some ops might make sense to transfer back to the CPU because the CPU can do them faster and then back to the device, e.g. sparse lookups on non-sparsity-appropriate hardware - I think this proposal would require cutting the graph into pieces in this case, which has the usual problems that cutting of the graph entails. Automatically or manually determining which ops should run where to optimize performance and minimize memory usage (not just to do something that works) is going to be a big thing over time and specifying such a thing should be a good fit.
-
Two devices may need to communicate with each other and we do not want to force them to go through the host for a large transfer. This is again typical in sea-of-nodes hardware and comes up in other contexts like just plain model parallelism. How do we represent sending data between devices? Can two functions refer to each other’s nodes?
-
(follows on from last point) How do we represent overlapping transfers of data with compute in a fine-grained way? This is an important optimization.