How to compute/schedule conv2d in specilized hardware arch


I’m working on a specilized hardware arch with TVM. Need your help about how to handle my case: inside of convolution engine, one conv2d is splited along the output channel, and computed by 4 Cores at the same time.
input data: it is broadcasted to 4 internal/independent buffers
weight: it’s splitted to 4 buffers along output channels. e.g. if 64 output channels, each core will compute 16 output channels
output data: output will be put into 4 internal/independent buffers
The finial convolution result need gather result from 4 output buffers.

Normally, one conv2d has one input tensor, weight tensor, and one output tensor. Each tensor will corresponds to one NDarray in runtime. How could i achieve to map one NDarray to 4 internal/independent buffers for 4 cores?
If it’s impossible in tensor level, will i handle this case in graph level?

Wish your comments.
Thank you