[RUNTIME] Can tvm call TensorRT as third-party runtime engine?

keai007 · December 10, 2019, 9:36am

Hi,all:
We used to deploy CNN models with TensorRT to achieve extrem performace, but we suffered from adding plugin layers for various operations used in detection networks. So I consider deploying models through “TVM + TensorRT” solution to gain both high performance and flexibility. Is there any examples on adding TensorRT as TVM’s third-party runtime engine？
Thanks.

masahi · December 10, 2019, 12:05pm

We don’t have that yet, but people are working on exactly such feature. You can track the progress in #4482. Maybe AWS already has TensorRT running with TVM.

comaniac · December 10, 2019, 7:20pm

Exactly. AWS is working on a unified interface for external codegen. TensorRT falls into this scope. One of our colleagues is working on the TensorRT support for TVM using this interface, and we plan to file an RFC after #4482 and its follow-up PRs have been merged.

beyersito · April 2, 2020, 3:09pm

Hi! Whats the current status on this? Thanks!

comaniac · April 2, 2020, 5:20pm

TRT integration is now working but only on the AWS forked repo. @trevor-m is working hard to make it upstream. Since the external codegen infra still requires some improvements, it might take some more time.

austinmw · July 8, 2020, 2:58pm

Hi, how will TensorRT and TVM operate with each other? I’m attempting to compile a pytorch model to a nvidia gpu which includes unsupported layer ops (DCNv2). Would TVM+TensorRT fit this use case?

comaniac · July 8, 2020, 4:50pm

I think so. TRT will only execute supported ops while other ops will be processed by TVM. This is one motivation to introduce BYOC for 3rd party libraries.

Hema_Sowjanya · December 10, 2020, 9:52am

Hi, So we can’t use the tvm along with tensorrt without neo-ai/tvm?

Thanks.

comaniac · December 10, 2020, 4:38pm

Yes you can use TensorRT in the upstream TVM. Here is a tutorial:

https://tvm.apache.org/docs/deploy/tensorrt.html#building-tvm-with-tensorrt-support

Hema_Sowjanya · December 15, 2020, 11:00am

Hey Cody,

Thanks for the reply. Do we need to perform any other operations apart from the things mentioned in the tutorial. Since I am not able to integrate TVM with TensorRT even after following the tutorial. Or is there any way to find out whether integration is successful? Cause I tried verifying the version and whether the runtime is enabled or not but no luck(Since I was getting the version of TensorRT and runtime as enabled).

Thankyou.

comaniac · December 15, 2020, 5:34pm

I don’t recall that. When I was setting the environment, I just made sure the TensorRT version is compatible to my CUDA version. After the install, I put the TensorRT to the LD_LIBRARY_PATH. If you’ve done all of these but still cannot work, you could provide more details to see if anyone in this forum has any suggestion.

Also cc @trevor-m

trevor-m · December 15, 2020, 5:59pm

Hi @Hema_Sowjanya

I’m happy to help out with any issues you are having using TVM-TRT. Could you share more details of the problems you are facing?

Hema_Sowjanya · December 16, 2020, 5:48am

Hi @trevor-m,

I installed a .deb file of TensorRT which is compatible with my cuda and OS version. Then I made the changes in the configuration file(set the TRT runtime as ON) and I ran the example given in the documentation but to my surprise one image took 5 seconds. Is there something I am missing here?

Thanks a lot @comaniac and @trevor-m for such wonderful support!!

trevor-m · December 16, 2020, 4:47pm

Hi @Hema_Sowjanya Glad you were able to get it working. The first inference only is expected to take longer because TensorRT has to build the inference engine. After that, all of the next inferences should be fast.

Hema_Sowjanya · December 16, 2020, 5:31pm

Hi @trevor-m, I ran a couple of times but the time didn’t change. So is there something I am missing here?

Thankyou:)

trevor-m · December 16, 2020, 6:11pm

Were all of the runs in the same process?

I.e.

# warm up
for i in range(10):
  mod.run(data=x)

times = []
for i in range(100):
  start = time.time()
  mod.run(data=x)
  times.append(time.time() - start)

print("Mean latency:", 1000.0 * np.mean(times))

What is your model?

Hema_Sowjanya · December 17, 2020, 1:52pm

Hi @trevor-m, This is my code snippet.

ctx = tvm.gpu(0)

loaded_lib = tvm.runtime.load_module(‘compiled.so’)

gen_module = tvm.contrib.graph_runtime.GraphModule(loaded_lib’default’)

input_data = np.random.uniform(0, 1, input_shape).astype(dtype)

gen_module.set_input(“data”,input_data)

s1=time.time()

for i in range(1000):

gen_module.run()

print(time.time()-s1)

tvm_output1 = gen_module.get_output(0)

This whole code was present in one cell. I am using Resnet50. I ran the above cell several times but I didn’t find any change in inference times. So should i split the code in different cells?

trevor-m · December 17, 2020, 5:14pm

Hi @Hema_Sowjanya As I mentioned, your first call to gen_module.run() after creating the graph runtime will take a long time. You should disregard it from your timing calculations. I’ve modified your script

ctx = tvm.gpu(0)
loaded_lib = tvm.runtime.load_module(‘compiled.so’)
gen_module = tvm.contrib.graph_runtime.GraphModule(loaded_lib’default’)
input_data = np.random.uniform(0, 1, input_shape).astype(dtype)

# Create TensorRT engines
gen_module.run()

s1=time.time()
for i in range(1000):
  gen_module.set_input(“data”,input_data)
  gen_module.run()
  tvm_output1 = gen_module.get_output(0)
print(time.time()-s1)

Also, inference time for resnet50 is around 5ms. Since you did 1000 runs, that is expected to take 5 seconds.

Hema_Sowjanya · December 18, 2020, 6:56am

Hi @trevor-m, Thanks for suggesting the modifications. I tried the code now it’s working, the inference time is 1ms.