Deformable conv2d extremely slow

gasgallo · December 3, 2019, 2:55am

Hi,

I’m currently testing a model that uses deformable convs. I see this op is supported by TVM, but I’m having terrible performance issues. I’ve been using runtime debugger to check why the model is so slow and discovered:

95% of model execution time is due to deformable convs
1 op specifically takes 72% of the time and another one 15%

Can I tune deformable convs with autotvm? What else can I do to improve model performance?

P.S. the same model but without deformable convs can run in 3% of the time of the model with deformable convs

masahi · December 3, 2019, 3:11am

Is your target CPU? I think the deformable conv schedule is implemented only for cuda. If you try it on CPU, you would get a default schedule, which is single threaded dumb for loop.

gasgallo · December 3, 2019, 3:17am

Yes, it’s CPU. Is there any way to get better performance at the moment?

masahi · December 3, 2019, 3:30am

No unless you are willing to get your hands dirty We need a specialization of schedule_deformable_conv2d_nchw for x86 or arm. Adding multithreading and vectorization is not difficult.

cc @vinx13

gasgallo · December 3, 2019, 3:35am

Can I take any other op as an example for doing that?

masahi · December 3, 2019, 3:47am

hmm I think schedules for existing ops (especially x86 conv2d) are too complicated to learn from. Maybe you can look at pooling. It’s relatively simpler and it does multithreading + vectorization.

And here is the cuda deformable conv2d schedule definition. You need to replace [‘cuda’, ‘gpu’] with “cpu”.

masahi · December 3, 2019, 3:46am

If you are new to scheduling, I recommend starting with GEMM optimization tutorial.

vinx13 · December 3, 2019, 4:04am

You can also take a look at the schedule for x86 conv2d_transpose, https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/x86/conv2d_transpose.py which doesn’t use explicit data packing like conv2d, it is less complicated