Use TVM-DarkNet to detect vidoes, after relay.build_module.build, cv2 ops cost much more time

irving512 · November 13, 2019, 5:42am

I’m new to TVM and try to use TVM-DarkNet to detect videos.

With the help of tutorials, I successfully get tuned yolov3-tiny models. module.module.time_evaluator inference time is 1.26 ms on P40 and 0.81 ms on V100.
the codes is here https://github.com/irvingzhang0512/tvm_tests/blob/master/darknet_tune.py.

Then I try to detect videos with the tuned models and cal fps.
Codes can found here https://github.com/irvingzhang0512/tvm_tests/blob/master/darknet_evaluate.py

The above codes are tested on two server.

Server 1(P40): 50fps, inference time is about 20 ms (image preprocessing 6-7ms + tvm m.set_input 2-3ms + tvm m.run/m.get_output 2.5ms + tvm api nms 0.2ms + cv2 read video frame 8-9ms)
Server 2(V100): 22fps, inference time is about 43 ms (image preprocessing 32-33ms + tvm m.set_input 2-3ms + tvm m.run/m.get_output 1-2ms + tvm api nms 0.2ms + cv2 read video frame 5-6ms)

Obviously, image preprocessing(cv2.resize, np ops) & cv2 read video frame(cv2.read()) cost too much time. So I try to evaluate cv2 only. (codes can be found here https://github.com/irvingzhang0512/tvm_tests/blob/master/cv2_only_evaluate.py)

Server 1, image preprocessing costs about 3ms, read video frame cost about 2ms.
Server 2, image preprocessing costs about 3ms, read video frame cost about 2-3ms.

Finally I find that, if relay.build_module.build is commented, image preprocessing and read video frame cost less time.(Codes can be found here https://github.com/irvingzhang0512/tvm_tests/blob/master/darknet_comment_test.py)

Server 1, image preprocessing 6-7ms/ read video frame 6-7ms (before commented) to image preprocessing 3-4 ms / read video frame 2 ms (after commented)
Server 2, image preprocessing 32ms/ read video frame 5-6ms (before commented) to image preprocessing 2-3ms/ read video frame 2-3ms (after commented)

My questions are:

Why cv2.resize & cv2.read cost much more time after relay.build_module? What should I do?
TVM m.set_input cost about 2ms, bigger than module.module.time_evaluator inference time, is it normal?
Is there any better solutions to detect videos with TVM Python API?(except run m.set_input for every frame)

vinx13 · November 13, 2019, 6:16am

Might be related to TVM & OpenCV Compatibility Issue

irving512 · November 13, 2019, 8:11am

So no solution to this issue for now. Maybe I should use pillow instead of opencv. Thanks.

vinx13 · November 13, 2019, 3:26pm

Would be good if someone can investigate if this is an issue with tvm

irving512 · November 14, 2019, 2:55am

Instead of cv2, use imageio to read videos and PIL.Image to resize images. still get the same problem.

Remove relay.build_module.build: read video frame cost 2ms, resize image cost 2ms. Keep relay.build_module.build: read video frame cost 7ms, resize image cost 1.5ms.

FrozenGene · November 14, 2019, 4:35am

Maybe you could try one quick workaround: rebuild TVM with config.cmake set(USE_OPENMP gnu)

irving512 · November 14, 2019, 4:47am

it works. Thanks

vinx13 · November 14, 2019, 4:48am

You can also try setting TVM_BIND_THREADS as mentioned in Auto-tuning too slow they might be the same issue

irving512 · November 14, 2019, 5:21am

it works, too

vinx13 · November 14, 2019, 5:33am

Can you explain how you set TVM_BIND_THREADS so that we can look into the issue?

irving512 · November 14, 2019, 5:46am

Server 1: haven’t rebuilt tvm. before running python darknet_evaluate.py, set env var export TVM_NUM_THREADS=8 & export TVM_BIND_THREADS=8, then i get 80-90 fps.

Server 2: rebuild tvm with set(USE_OPENMP gnu)，then i get 90-100fps.

FrozenGene · November 14, 2019, 5:56am

I think the reason is exclude_worker0. We default set it be true and will bind the master thread to core 0. When OpenCV thread enter into, then I find core0 is 100% very long time.

So I think better way is set exclude_worker0 to be false default.

  // if excluding worker 0 and using master to run task 0
#ifndef _LIBCPP_SGX_CONFIG
  bool exclude_worker0_{true};
#else
  bool exclude_worker0_{false};
#endif

change into

bool exclude_worker0_{false};

cc @yidawang

yidawang · November 14, 2019, 8:32am

We set exclude_worker0_ to be true by default because in most of the cases the master has not much to do. In your case if you have heavy work in the master, you can manually set it to be false. BTW, I think we should have a way to specify this value via an ENV VAR.

FrozenGene · November 14, 2019, 10:23am

I have used VTune to analyze between binding task 0 to master or not.

Bind task 0 to master, thread transitions occupy much time.

Yellow is the thread transitions part

image.jpg1308×506 104 KB
Don’t bind task 0 to master
image.jpg1024×522 119 KB
OMP

image.png826×448 54.9 KB

So yes, we should add env var to do it. For example, we could add TVM_EXCLUDER_WORKER0 to handle it. Then if users pass TVM_EXCLUDER_WORKER0=0, we change the value be false. How about this solution?

@yidawang

vinx13 · November 14, 2019, 2:53pm

A common use case is auto tuning. If we quantize a model (which starts threadpool during constant folding), and then extract tasks and tune them, it will be very slow because only thread0 is used instead of every thread. So it is necessary to let users to know

FrozenGene · November 14, 2019, 5:54pm

@vinx13 @yidawang would you mind I opening one PR to get the environment TVM_EXCLUDER_WORKER0 value? I think it is common to meet this problem when to use OpenCV or auto tuning said by @vinx13. We should have one way to solve it.

vinx13 · November 15, 2019, 12:31am

Sure that would be helpful. But for the auto tuning case, this seems still a little painful, as users are likely to forget this

FrozenGene · November 15, 2019, 2:39am

@vinx13 I agree with you. At lease 2 cases we need it.

OpenCV + TVM at this case, which is common when we use TVM deploy CNN models in production
Auto Tuning

I prefere change the default value of exclude_worker0_, i.e. bool exclude_worker0_{false}; then we expose one environment of TVM_EXCLUDER_WORKER0, when we meet 1, we change it to be true. I think it is more reasonable.

How about this?

cc @yidawang

yidawang · November 15, 2019, 3:13am

Am I missing something? Why exclude_worker0_ would lead to “only thread0 is used instead of every thread”?

yidawang · November 15, 2019, 3:14am

I am fine with adding the ENV VAR. Please feel free to submit the PR