does tvm fully support for depthwise_conv2d autotun dilation op？ thanks a lot !
you could refer arm_cpu’s implementation to implement : https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/depthwise_conv2d.py#L192
thanks a lot!
since the tensorflow decompose the depthwise dilated conv2d into three steps: 1. SpaceToBatchND 2.DepthWiseConv2d 3. BatchToSpaceND , tensorflow frontend already implemented the 1 and 3 part, as long as opencl can do the depthwise_conv2d op, it can do the autotune of depthwise conv2d with different dilated rates.
or refer to topi/python/topi/cuda/depthwise_conv2d.py?
What you say is another story. Some framework will combine SpaceToBatchND + depthwise + BatchToSpaceND into depthwise (dilation > 1). If you see these three ops, depthwise shouldn’t be problem. You should optimize SpaceToBatchND / BatchToSpaceND. But according to my expr, depthwise (dilation > 1) is better, some converter tool will do this combination, for example TF->CoreML converter.
Thanks a lot ! I got it!
There is another problem related to the deployment on Windows, I got the .log generated after autotune on Ubuntu using two threads. But when I used this .log to deploy on windows, found that it just used one thread when doing the inference, in this case, I should do the autotune again using two threads on windows in order to double the inference speed?
This is not related with auto tune log. You should dig into why windows could only use one thread.