Deformable conv2d extremely slow

Hi,

I’m currently testing a model that uses deformable convs. I see this op is supported by TVM, but I’m having terrible performance issues. I’ve been using runtime debugger to check why the model is so slow and discovered:

  • 95% of model execution time is due to deformable convs
  • 1 op specifically takes 72% of the time and another one 15%

Can I tune deformable convs with autotvm? What else can I do to improve model performance?

P.S. the same model but without deformable convs can run in 3% of the time of the model with deformable convs

Is your target CPU? I think the deformable conv schedule is implemented only for cuda. If you try it on CPU, you would get a default schedule, which is single threaded dumb for loop.

1 Like

Yes, it’s CPU. Is there any way to get better performance at the moment?

No unless you are willing to get your hands dirty :slight_smile: We need a specialization of schedule_deformable_conv2d_nchw for x86 or arm. Adding multithreading and vectorization is not difficult.

cc @vinx13

Can I take any other op as an example for doing that?

hmm I think schedules for existing ops (especially x86 conv2d) are too complicated to learn from. Maybe you can look at pooling. It’s relatively simpler and it does multithreading + vectorization.

And here is the cuda deformable conv2d schedule definition. You need to replace [‘cuda’, ‘gpu’] with “cpu”.

1 Like

If you are new to scheduling, I recommend starting with GEMM optimization tutorial.

You can also take a look at the schedule for x86 conv2d_transpose, https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/x86/conv2d_transpose.py which doesn’t use explicit data packing like conv2d, it is less complicated

1 Like