[Performance Issue] Argsort and NMS slow on large workloads on CUDA cards

Large workloads in models such as mask-rcnn could be very slow on current argsort and nms operators reported in this analysis report. One possible way to improve performance on Nvidia cards is to use external library such as cuda specific library CUB. Creating the discussion here to track the issue. Suggestions are also welcomed. @kevinthesun