SOLVED [external codegen] How the runtime determines function signatures for generated functions?

Hi, I’m working on an example of translating fused conv + bias add + relu ops (which come from conv + bn + relu after FoldScaleAxis + FoldConstant) to dnnl backend.

I’ve modified dnnl/codgen.cc to handle fused ops and currently I can emit code correctly.

extern "C" void dnnl_0_(float* dnnl_input0, float* dnnl_input1, float* bias, float* out) {
  float* buf_0 = (float*)std::malloc(4 * 802816);

  dnnl_fused_conv2d_bias_relu(dnnl_input0, dnnl_input1, bias, buf_0, 1, 3, 224, 224, 16, 1, 1, 1, 3, 3, 1, 1);
  std::memcpy(out, buf_0, 4 * 802816);
  std::free(buf_0);
}

extern "C" int dnnl_0_wrapper_(DLTensor* arg0,
	DLTensor* arg1,
	DLTensor* arg2,
	DLTensor* arg3) {
  dnnl_0_(static_cast<float*>(arg0->data),
  static_cast<float*>(arg1->data),
  static_cast<float*>(arg2->data),
  static_cast<float*>(arg3->data));
  return 0;
}

Here, the “bias” parameter comes from the bias add op which originally follows the conv op but now fused.

So I’m generating a new signature inside codegen to handle fused ops, but the problem is that the runtime doesn’t know about this new signature. So when I try to run the generated function I get the following error:

TVMError: Check failed: ret == 0 (-1 vs. 0) : [23:36:07] /home/masa/projects/dev/tvm/include/tvm/runtime/packed_func.h:1107:
Check failed: i < num_args (3 vs. 3) : not enough argument passed, 3 passed but request arg[3].

I think I need to modify the runtime code (both vm and graph). How should I go about this? Or is there another way to handle fused ops that doesn’t require runtime change? @zhiics @comaniac

hmm if I disable fusion, I get the same signature and error. Maybe the problem is not fusion itself.

extern "C" void dnnl_0_(float* dnnl_input0, float* dnnl_input1, float* dnnl_input2, float* out) {
  float* buf_0 = (float*)std::malloc(4 * 802816);
  float* buf_1 = (float*)std::malloc(4 * 802816);
  float* buf_2 = (float*)std::malloc(4 * 802816);

  dnnl_conv2d(dnnl_input0, dnnl_input1, buf_0, 1, 3, 224, 224, 16, 1, 1, 1, 3, 3, 1, 1);
  dnnl_add(buf_0, dnnl_input2, buf_1, 1, 16, 224, 224);
  dnnl_relu(buf_1, buf_2, 1, 16, 224, 224);
  std::memcpy(out, buf_2, 4 * 802816);
  std::free(buf_0);
  std::free(buf_1);
  std::free(buf_2);
}

extern "C" int dnnl_0_wrapper_(DLTensor* arg0,
	DLTensor* arg1,
	DLTensor* arg2,
	DLTensor* arg3) {
  dnnl_0_(static_cast<float*>(arg0->data),
  static_cast<float*>(arg1->data),
  static_cast<float*>(arg2->data),
  static_cast<float*>(arg3->data));
  return 0;
}

It seems to me that, to achieve what I’m trying to do, external codegen like CodegenDNNL should be ExprMutator, not ExprVisitor as in the current implementation.

Otherwise GraphRuntimeCodegen has no way to know that each subgraph is actually fused. It will blindly visit each subgraph and generate signatures for non-fused subgraphs. So in my case below, it generates two signatures, both of which has 2 inputs and 1 output. What I want here is 3 inputs (data, weight, bias) function.

def @main(%data: Tensor[(1, 3, 224, 224), float32]) -> Tensor[(1, 16, 224, 224), float32] {
  %2 = fn (%dnnl_input0: Tensor[(1, 3, 224, 224), float32], %dnnl_input1: Tensor[(16, 3, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_1", Primitive=1) -> Tensor[(1, 16, 224, 224), float32] {
    %0 = nn.conv2d(%dnnl_input0, %dnnl_input1, padding=[1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */;
    %1 = add(%0, meta[relay.Constant][1] /* ty=Tensor[(16, 1, 1), float32] */ /* ty=Tensor[(16, 1, 1), float32] */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
    nn.relu(%1) /* ty=Tensor[(1, 16, 224, 224), float32] */
  };
  %3 = %2(%data, meta[relay.Constant][0] /* ty=Tensor[(16, 3, 3, 3), float32] */ /* ty=Tensor[(16, 3, 3, 3), float32] */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %6 = fn (%dnnl_input2: Tensor[(1, 16, 224, 224), float32], %dnnl_input3: Tensor[(16, 16, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_0", Primitive=1) -> Tensor[(1, 16, 224, 224), float32] {
    %4 = nn.conv2d(%dnnl_input2, %dnnl_input3, padding=[1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */;
    %5 = add(%4, meta[relay.Constant][3] /* ty=Tensor[(16, 1, 1), float32] */ /* ty=Tensor[(16, 1, 1), float32] */) /* ty=Tensor[(1, 16, 224, 224), float32] */;
    nn.relu(%5) /* ty=Tensor[(1, 16, 224, 224), float32] */
  };
  %6(%3, meta[relay.Constant][2] /* ty=Tensor[(16, 16, 3, 3), float32] */ /* ty=Tensor[(16, 16, 3, 3), float32] */) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

Another solution may be to fix the output of partitioning. I don’t know why, but right now the bias parameter is directly embedded as a constant rather than passed as an argument. This creates inconsistency in the signature CodegenDNNL generates.

sorry it was due to another bug in my annotator. I didn’t have compiler_begin on the bias parameter.

I’ll update the PR below to add support for fused op in the dnnl backend.

I believe we probably don’t want to handle fusion for external codegen. Instead, we probably should let whatever external codegen tool handle it by themselves so that Relay can just view the whole blob as a single super op. We may need to consider how we can enable/disable optimizations for functions annotated as external.

what I was trying to do here was not to ask external codegen to do fusion. Since partitioning and fusion are similar in concept and partitioning can already do what I expect it to do, I have everything I need. I just wanted Relay to handle partitioned (or fused) subgraphs correctly (which it already can, the issue was due to my bug).

@zhiics After thinking about a bit more, I think I understand what you are saying. For backends such as TensorRT, which has its own fusion engine in addition to kernel implementation, offloading fusion concerns to them completely makes sense. In this case the job of Relay external codegen is to pass the subgraph as is in a format that TensorRT understands, without doing any graph level optimization.

But I’d imagine there are many use cases where backend wants to rely on TVM for fusion as well. DNNL is such an example. They have support for fused operations, but they are not compiler per se and cannot detect fused operations on its own. That’s why MXNet folks developed “Subgraph API” which is quite similar to external codegen mechanism in TVM.

As demonstrated in my PR, since manual annotation + partitioning can already achieve fusion-like graph transformation, I think we can also explore this use case as well.

Does this make sense? @zhiics @comaniac

Sounds good. I’ll review the PR later.

1 Like