nms operation cost too mush time

Hi: I create a network using relay.build. Now I need to add the nms(relya.vision.non_max_suppression) after the network, and then get a new network, But I find only the nms operation cost about 30ms time. It is too strange to cost so much time. I want to know if there is any wrong with nms using or add the nms layer. There are part code as follows:

relay_mod, relay_params = relay.frontend.from_mxnet(
        mx_sym,
        shape=input_shapes,
        dtype={'data': 'float32'},
        arg_params=arg_params,
        aux_params=aux_params
    )
func = relay_mod["main"]
valid_count = relay.var('valid_count', relay.TensorType((1,), 'int32'),)
out = relay.vision.non_max_suppression(func.body, valid_count, max_output_size=100, top_k=400, iou_threshold=0.45, return_indices=False)
func = relay.Function(func.params, out, None, func.type_params, func.attrs)
target = tvm.target.create('cuda')
with autotvm.apply_history_best(log_file):
    print('Compile with relay ...')
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build(
            func,
            # relay_mod,
            target,
            params=relay_params
        )

What is your compilation target?