[RUNTIME] Can tvm call TensorRT as third-party runtime engine?

Hi,all:
We used to deploy CNN models with TensorRT to achieve extrem performace, but we suffered from adding plugin layers for various operations used in detection networks. So I consider deploying models through “TVM + TensorRT” solution to gain both high performance and flexibility. Is there any examples on adding TensorRT as TVM’s third-party runtime engine?
Thanks.

1 Like

We don’t have that yet, but people are working on exactly such feature. You can track the progress in #4482. Maybe AWS already has TensorRT running with TVM.

2 Likes

Exactly. AWS is working on a unified interface for external codegen. TensorRT falls into this scope. One of our colleagues is working on the TensorRT support for TVM using this interface, and we plan to file an RFC after #4482 and its follow-up PRs have been merged.

5 Likes

Hi! Whats the current status on this? Thanks!

TRT integration is now working but only on the AWS forked repo. @trevor-m is working hard to make it upstream. Since the external codegen infra still requires some improvements, it might take some more time.

Hi, how will TensorRT and TVM operate with each other? I’m attempting to compile a pytorch model to a nvidia gpu which includes unsupported layer ops (DCNv2). Would TVM+TensorRT fit this use case?

I think so. TRT will only execute supported ops while other ops will be processed by TVM. This is one motivation to introduce BYOC for 3rd party libraries.

Hi, So we can’t use the tvm along with tensorrt without neo-ai/tvm?

Thanks.

Yes you can use TensorRT in the upstream TVM. Here is a tutorial:

https://tvm.apache.org/docs/deploy/tensorrt.html#building-tvm-with-tensorrt-support

Hey Cody,

Thanks for the reply. Do we need to perform any other operations apart from the things mentioned in the tutorial. Since I am not able to integrate TVM with TensorRT even after following the tutorial. Or is there any way to find out whether integration is successful? Cause I tried verifying the version and whether the runtime is enabled or not but no luck(Since I was getting the version of TensorRT and runtime as enabled).

Thankyou.

I don’t recall that. When I was setting the environment, I just made sure the TensorRT version is compatible to my CUDA version. After the install, I put the TensorRT to the LD_LIBRARY_PATH. If you’ve done all of these but still cannot work, you could provide more details to see if anyone in this forum has any suggestion.

Also cc @trevor-m

Hi @Hema_Sowjanya

I’m happy to help out with any issues you are having using TVM-TRT. Could you share more details of the problems you are facing?

Hi @trevor-m,

I installed a .deb file of TensorRT which is compatible with my cuda and OS version. Then I made the changes in the configuration file(set the TRT runtime as ON) and I ran the example given in the documentation but to my surprise one image took 5 seconds. Is there something I am missing here?

Thanks a lot @comaniac and @trevor-m for such wonderful support!!

Hi @Hema_Sowjanya Glad you were able to get it working. The first inference only is expected to take longer because TensorRT has to build the inference engine. After that, all of the next inferences should be fast.

Hi @trevor-m, I ran a couple of times but the time didn’t change. So is there something I am missing here?

Thankyou:)

Were all of the runs in the same process?

I.e.

# warm up
for i in range(10):
  mod.run(data=x)

times = []
for i in range(100):
  start = time.time()
  mod.run(data=x)
  times.append(time.time() - start)

print("Mean latency:", 1000.0 * np.mean(times))

What is your model?

Hi @trevor-m, This is my code snippet.

ctx = tvm.gpu(0)

loaded_lib = tvm.runtime.load_module(‘compiled.so’)

gen_module = tvm.contrib.graph_runtime.GraphModule(loaded_lib’default’)

input_data = np.random.uniform(0, 1, input_shape).astype(dtype)

gen_module.set_input(“data”,input_data)

s1=time.time()

for i in range(1000):

gen_module.run()

print(time.time()-s1)

tvm_output1 = gen_module.get_output(0)

This whole code was present in one cell. I am using Resnet50. I ran the above cell several times but I didn’t find any change in inference times. So should i split the code in different cells?

Hi @Hema_Sowjanya As I mentioned, your first call to gen_module.run() after creating the graph runtime will take a long time. You should disregard it from your timing calculations. I’ve modified your script

ctx = tvm.gpu(0)
loaded_lib = tvm.runtime.load_module(‘compiled.so’)
gen_module = tvm.contrib.graph_runtime.GraphModule(loaded_lib’default’)
input_data = np.random.uniform(0, 1, input_shape).astype(dtype)

# Create TensorRT engines
gen_module.run()

s1=time.time()
for i in range(1000):
  gen_module.set_input(“data”,input_data)
  gen_module.run()
  tvm_output1 = gen_module.get_output(0)
print(time.time()-s1)

Also, inference time for resnet50 is around 5ms. Since you did 1000 runs, that is expected to take 5 seconds.

4 Likes

Hi @trevor-m, Thanks for suggesting the modifications. I tried the code now it’s working, the inference time is 1ms.