Which files are regarding node-level parallelism?

wolverines · April 22, 2019, 3:14am

I am trying to optimize inference on multi-core cpu. It seems to me that the thread_pool.cc and threading_backend.cc files are for parallelism within one node(operator), eg. parallelism within one conv operation. I am wondering which files are regarding the node-level parallelism? For example, scheduling multiple different conv kernel operations to different cores/threads? Thanks!

FrozenGene · April 22, 2019, 5:41am

TVM haven’t done this feature. It should let NNVM / Relay do data-flow analyze and find nodes haven’t data dependent, then we could make these nodes run on multi-core cpu. This feature could have performance improvement on models have branch and branched nodes are very small.

wolverines · April 22, 2019, 2:23pm

Thank you so much! If we want to implement this feature, where should I start? For example, how can I bind the different independent operations to different threads? I guess I should modify the build() function to do that, am I right? Thanks!

Qiu1981 · August 13, 2019, 10:30am

Hi FrozenGene,
In GoogLeNet, there are parallel branches as shown in following figure:

Possbile op execution sequences are:
T0->T1->T2->T3->T4->T5->T6->T7
T0->T1->T6->T7->T4->T5->T2->T3
How does Relay/TVM determine the op execution sequences now?
Thank you

zhiics · August 13, 2019, 6:06pm

TVM runtime currently topologically traverse the graph and execute each of the nodes in a sequential way.

github.com

dmlc/tvm/blob/5f9c5e43020a602427b7995afb9eedf2b695eea8/src/runtime/graph/graph_runtime.cc#L329


      storage_pool_[storage_id].CreateView(attrs_.shape[i], vtype[i]);
  const DLTensor* tmp = data_entry_[i].operator->();
  data_alignment_[i] = details::GetDataAlignment(*tmp);
}
}


void GraphRuntime::SetupOpExecs() {
op_execs_.resize(this->GetNumOfNodes());
op_args_.resize(this->GetNumOfNodes());
// setup the array and requirements.
for (uint32_t nid = 0; nid < this->GetNumOfNodes(); ++nid) {
  const auto& inode = nodes_[nid];
  if (inode.op_type == "null") continue;
  std::vector<DLTensor> args;
  std::vector<uint32_t> input_entry_ids;
  for (const auto& e : inode.inputs) {
    uint32_t eid = this->entry_id(e);
    args.push_back(*(data_entry_[eid].operator->()));
    input_entry_ids.push_back(eid);
  }
  for (uint32_t index = 0; index < inode.param.num_outputs; ++index) {

The execution order of the “parallel” nodes are consistent to what they are stored in the json file.

github.com

dmlc/tvm/blob/master/src/runtime/graph/graph_runtime.h#L344


        } else {
          LOG(FATAL) << "cannot skip graph attr " << key;
        }
        CHECK(!reader->NextArrayItem());
      }
    }
    CHECK_EQ(bitmask, 1|2|4) << "invalid format";
  }
};
// The graph attribute fields.
void Load(dmlc::JSONReader *reader) {
    reader->BeginObject();
    int bitmask = 0;
    std::string key;
    while (reader->NextObjectItem(&key)) {
      if (key == "nodes") {
        reader->Read(&nodes_);
        bitmask |= 1;
      } else if (key == "arg_nodes") {
        reader->Read(&input_nodes_);
        bitmask |= 2;

wda · October 25, 2019, 5:30am

Can we make node-level parallelism on gpu?Did you find way to run node-level parallelism on cpu?@ wolverines@ FrozenGene