TVM & OpenCV Compatibility Issue

This is imported from https://github.com/dmlc/tvm/issues/2924

I am recently trying to test GluonCV models with TVM deployment, and I met an unexpected performance issue.

Specifically I try to run a CV model with input from frames in a video. I use OpenCV to read each frame.

OpenCV is fast

Here’s a piece of code that reads each frame and switch the Blue and Red channels:

import time
import cv2

cap = cv2.VideoCapture('demo_video.mp4')
for i in range(30):
    tic = time.time()

    ret, frame = cap.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    time_diff = time.time() - tic
    if i > 10:
        print(int(time_diff*1000))

cap.release()

The output should be around 1~3 ms per cycle. This is fast as expected.

TVM + OpenCV is slow

Now, lets add some TVM code into it.

import time
import cv2

import tvm
from tvm.relay.testing.config import ctx_list
from tvm import relay
from tvm.contrib import graph_runtime

net, params = relay.testing.resnet.get_workload(
    num_layers=18, batch_size=1, image_shape=(3, 224, 224))
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(net, 'llvm', params=params)

cap = cv2.VideoCapture('demo_video.mp4')
for i in range(20):
    tic = time.time()

    ret, frame = cap.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    time_diff = time.time() - tic
    if i > 10:
        print(int(time_diff*1000))

cap.release()

Note: The compiled TVM model is not even executed.

The result is 51 ms per cycle, which is extremely slower than the first case.

TVM + Numpy is fast

I replace the cvtColor call by the equivalent numpy implementation:

import time
import cv2

import tvm
from tvm.relay.testing.config import ctx_list
from tvm import relay
from tvm.contrib import graph_runtime

net, params = relay.testing.resnet.get_workload(
    num_layers=18, batch_size=1, image_shape=(3, 224, 224))
with relay.build_config(opt_level=3):
    graph, lib, params = relay.build(net, 'llvm', params=params)

cap = cv2.VideoCapture('demo_video.mp4')
for i in range(20):
    tic = time.time()

    ret, frame = cap.read()
    frame[:,:,(2,1,0)] = frame[:,:,(0,1,2)]

    time_diff = time.time() - tic
    if i > 10:
        print(int(time_diff*1000))

cap.release()

And the speed resumes to be 3~4 ms per cycle. Here speed being 4 ms other than 2 ms is because that cvtColor is expected to be more efficient than the numpy indexing. However TVM has an implicit negative effect on the performance thus “boosts” cvtColor up to 50 ms.

Environments:

OS: Ubuntu 16.04
cv2: pip install opencv-python (4.0.0.21)
tvm: master at dfe4c466
hardware: AWS C5.18x Instance
data: I believe you can reproduce it with an arbitrary input video.

1 Like

Could you run perf record -ag on the script in both instances, and post the results of perf report for both instances? That will illuminate which exact functions are being called in both instances. This is probably some dynamic linking issue.

1 Like

Seems this is related to FoldConstant pass, which starts a interpreter.
Importing tvm or relay doesn’t cause performance issue.
Here are a reproducible case:

import time
import cv2

import tvm
from tvm import relay
import tvm.relay.testing

net, params = relay.testing.resnet.get_workload(
    num_layers=18, batch_size=1, image_shape=(3, 224, 224))
net = relay.build_module._bind_params_by_name(net, params)
net = relay.ir_pass.infer_type(net)
net = relay.ir_pass.simplify_inference(net)
net = relay.ir_pass.infer_type(net)

net = relay.ir_pass.fold_constant(net) # OpenCV becomes faster if we comment out this line

cap = cv2.VideoCapture('demo_video.mp4')
for i in range(20):
    tic = time.time()
    ret, frame = cap.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    time_diff = time.time() - tic
    if i > 10:
        print(int(time_diff*1000))

cap.release()

perf report:
TVM+OpenCV https://gist.github.com/vinx13/c2a20162a0f21196d6b8f82a64107ad3
OpenCV https://gist.github.com/vinx13/72b81d74216e0afa651dd603a1fcb1f9

2 Likes

This is interesting. Can you confirm if it is due to interpreter or the model? e.g. if we save the compiled model in another file, and then load the model again to execute it, will the performance be better?

The issue might be related with thread pool.

import time
import cv2

import tvm
from tvm import relay
import tvm.relay.testing

n = 16

# a = tvm.placeholder((n,))
# b = tvm.compute((n,), lambda i: a[i])
# s = tvm.create_schedule(b.op)
# s[b].parallel(b.op.axis[0]) # this will start a thread pool and cause issue
# f = tvm.build(s, [a,b], 'llvm')
# f.export_library('foo.so')
f = tvm.module.load('foo.so')

ctx = tvm.cpu(0)

a_nd = tvm.nd.empty((n,))
b_nd = tvm.nd.empty((n,))
f(a_nd,b_nd)

cap = cv2.VideoCapture('demo_video.mp4')
for i in range(20):
    tic = time.time()
    ret, frame = cap.read()
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    time_diff = time.time() - tic
    if i > 10:
        print(int(time_diff*1000))

cap.release()

In this script, if we comment out either s[b].parallel(b.op.axis[0]) or f(a_nd,b_nd), the performance is normal.

If we are in single core, even we are using parallel, everything works fine?

setting env var TVM_NUM_THREADS=1 doesn’t help

Found a suspicious issue reported in OpenCV: https://github.com/opencv/opencv/issues/6123

Maybe opencl is what to be blamed?