Large workloads in models such as mask-rcnn could be very slow on current argsort and nms operators reported in this analysis report. One possible way to improve performance on Nvidia cards is to use external library such as cuda specific library CUB. Creating the discussion here to track the issue. Suggestions are also welcomed. @kevinthesun
2 Likes