[VTA] Issue with TSIM based simulation of resnet18 inference on VTA

liangfu · September 16, 2019, 4:24am

I think there is an issue with current TSIM for resnet-18 computation on VTA. My current experiments are constantly reproducible on both Linux and macOS, while attempts with FSIM backend are successful.

On both Linux and macOS, it just crashed into a segmentation fault error. To reproduce the error, configure vta/config/vta_config.json to use tsim backend, and run deploy_vision_on_vta.py with python3.

SunicYosen · October 7, 2019, 7:35am

Same error even pull the lastest code at 0ct 7, 2019!

It happens when run to:

timer = m.module.time_evaluator("run", ctx, number=num, repeat=rep)
timer()

The error is as:

Stack trace:
  [bt] (0) /home/sun/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2b64150) [0x7f8b90894150]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f8ba416ef20]
  [bt] (2) /tmp/tmpzoqyk19b/graphlib.o.so(+0x1f27e) [0x7f8b6b6cf27e]
  [bt] (3) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(tvm::runtime::ThreadPool::Launch(int (*)(int, TVMParallelGroupEnv*, void*), void*, int, int)+0xfee) [0x7f8b63b3539e]
  [bt] (4) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(TVMBackendParallelLaunch+0x63) [0x7f8b63b32e93]
  [bt] (5) /tmp/tmpzoqyk19b/graphlib.o.so(+0x1ed2b) [0x7f8b6b6ced2b]
  [bt] (6) /tmp/tmpzoqyk19b/graphlib.o.so(fused_nn_conv2d_add_nn_relu+0x3c3) [0x7f8b6b6ce8e3]
  [bt] (7) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(+0xbd8210) [0x7f8b63b20210]
  [bt] (8) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(+0xc2bfe7) [0x7f8b63b73fe7]
Stack trace:
  [bt] (0) /home/sun/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2b64150) [0x7f8b90894150]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f8ba416ef20]
  [bt] (2) /tmp/tmpzoqyk19b/graphlib.o.so(+0x1f27e) [0x7f8b6b6cf27e]
  [bt] (3) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(tvm::runtime::ThreadPool::RunWorker(int)+0x157) [0x7f8b63b33947]
  [bt] (4) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<tvm::runtime::threading::ThreadGroup::Impl::Impl(int, std::function<void (int)>, bool)::{lambda()#1}> > >::_M_run()+0x31) [0x7f8b63b36521]
  [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f8b8d73566f]
  [bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8ba3f176db]
  [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8ba425188f]
terminate called after throwing an instance of 'dmlc::Error'
  what():  [15:18:42] /home/sun/File/TVM/Projects/tvm/src/runtime/workspace_pool.cc:116: Check failed: allocated_.size() == 1 (3 vs. 1) : 
Stack trace:
  [bt] (0) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(tvm::runtime::WorkspacePool::Pool::Release(DLContext, tvm::runtime::DeviceAPI*)+0x7d7) [0x7f8b63b3b527]
  [bt] (1) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(tvm::runtime::WorkspacePool::~WorkspacePool()+0x37) [0x7f8b63b39937]
  [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(__call_tls_dtors+0x3f) [0x7f8ba41738af]
  [bt] (3) /lib/x86_64-linux-gnu/libc.so.6(+0x43117) [0x7f8ba4173117]
  [bt] (4) /lib/x86_64-linux-gnu/libc.so.6(+0x4313a) [0x7f8ba417313a]
  [bt] (5) /home/sun/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2b64188) [0x7f8b90894188]
  [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f8ba416ef20]
  [bt] (7) /tmp/tmpzoqyk19b/graphlib.o.so(+0x1f27e) [0x7f8b6b6cf27e]
  [bt] (8) /home/sun/File/TVM/Projects/tvm/build/libtvm.so(tvm::runtime::ThreadPool::Launch(int (*)(int, TVMParallelGroupEnv*, void*), void*, int, int)+0xfee) [0x7f8b63b3539e]


[1]    3914 abort (core dumped)  python3 -m pdb vta/tutorials/frontend/deploy_vision_on_vta.py

Do you have solution or suggestions at it?

vegaluis · October 7, 2019, 3:09pm

We are still working on end-to-end support for VTA/TSIM. At the moment, only unittests are working.

Keran_Zhang · October 11, 2019, 8:09am

With “TARGET” set to “tsim” in vta_config.json, I tried the demos in vta/tutorials and vta/tests/integration (including the test_vta_insn.py). The error is:

File “/krzhang/tvm/tvm0809/tvm-0809-base/vta/python/vta/testing/simulator.py”, line 43, in _load_all
m = tvm.module.load(lib[0], “vta-tsim”)
IndexError: list index out of range

It seems that “libvta_hw.so” is missing.
Am I using too old version?
Appreciate it!

vegaluis · October 11, 2019, 5:48pm

You have to first build the hardware library by going to tvm/vta/hardware/chisel and then running make that should create libvta_hw.so

Hope it helps!

stevenmburns · November 18, 2019, 10:19pm

Has there been any progress on this? Should we try to fix this ourselves or is there active working going on now to fix it?

liangfu · November 19, 2019, 2:01am

I’m working on a fix to this. Please stay tuned.

stevenmburns · December 4, 2019, 5:32pm

We are still interested in a fix here. If you let us know how far you have gotten, we can try to take it over from there.

liangfu · December 9, 2019, 7:52am

Thanks for your attention, I think for now, you can have a successful evaluation of the test_benchmark_topi_conv2d.py script with TSIM backend. The script performs most of the workloads in resnet18. Therefore, I think the hardware implement (Chisel VTA) along with TSIM based simulation should be fine for now.

As for the problem in evaluating deploy_vision_on_vta.py script, the error reported seems to be related to the integration with relay and the runtime.

Note that you might need to duplicate the lines with pynq_1x16_i8w8a32_15_15_18_17 in the file ~/.tvm/tophub/vta_v0.06.log, and replace pynq_1x16_i8w8a32_15_15_18_17 with tsim_1x16_i8w8a32_15_15_18_17 in order to load pre-tuned schedule parameters correctly.

stevenmburns · December 9, 2019, 5:43pm

Thanks! I’m able to get that running through.

liangfu · December 12, 2019, 9:24am

Hi @thierry, since @stevenmburns can also eval test_benchmark_topi_conv2d.py successfully with TSIM backend, I think there is no hardware issue in Chisel VTA to enable end-to-end inference.

As we are heading towards enabling end-to-end inference with the deploy_vision_on_vta.py script, I observed the segment fault actually take place in the 1st layer of resnet18. I also observed that the 1st layer of resnet18 doesn’t actually run on VTA, since it is ahead of “nn.max_pool2d” layer.

The stack trace looks like the following (some functions actually exist in the generated code):

#0  0x00007fffba0a0cd6 in tvm::runtime::ThreadPool::RunWorker(int) (this=0x2032da8, worker_id=1)
    at /home/liangfu/workspace/tvm_upstream/src/runtime/thread_pool.cc:365
#1  0x00007fffba0a04f9 in tvm::runtime::ThreadPool::ThreadPool()::{lambda(int)#1}::operator()(int) const
    (__closure=0x22f0518, worker_id=1) at /home/liangfu/workspace/tvm_upstream/src/runtime/thread_pool.cc:291
...
#3 in __TVMBackendParallelLaunch
#4 in fused_nn_conv2d_add_nn_relu_compute_
#5 in fused_nn_conv2d_add_nn_relu
...

Do you have any suggestions in making this actually work?

stevenmburns · December 12, 2019, 3:36pm

We (Intel Strategic CAD Labs) would like to get the end to end flow working as well. A few observations and questions from our end:

The end to end flow works with target “sim” but not with “tsim”. The verilator simulation resets and performs zero or one clock ticks out of reset before the crash occurs. I get four separate core dumps before the threading code produces a stack trace. When I run in gdb I see the first core dump is in the code generated by the runtime (libgraph I think). There are no debugging symbols to see exactly what happened. (Perhaps there is a way to get more debug visability here.) Why would sim work and tsim not work before the tsim simulator starts doing anything real? Any thought?
Does the end to end flow work for you on the de10 nano, or do you get a similar issue with a runtime coredump? We have a de10 nano up and running, but I don’t know the results yet of running the end to end flow? Is it expected to work? If it does work, what is different about this environment than the tsim verilator set up that could cause the difference?

Thanks for your help.

liangfu · December 13, 2019, 2:34am

@stevenmburns Thanks for the comments. Unfortunately I don’t have a fix as well.

My previous post was a diagnose on the fix to bring TSIM support for the end to end work flow.
It doesn’t work on my de10-nano as well. The error take place on the 1st layer of resnet18 inference, which is designed to run on cpu instead, if I understand correctly.

@vegaluis @thierry do you have any suggestions?

liangfu · December 16, 2019, 5:15am

The crash is caused by the use of virtual memory in TSIM driver. The runtime is trying to reach virtual memory address. I brought a PR to fix the issue. (See PR #4527 )

BharathKinnal · December 23, 2019, 10:17am

After making the changes you suggested in the TSIM driver, I was able to run the program without any segmentation fault. But I am facing a couple of issues :

In the PR, you’ve mentioned that because of the multi-threading support it would take around 5 minutes to perform the cycle-accurate simulation. However, it took around 3-4 hours to run (I ran 8 threads in an 8-core processor).
At the end, the resulting prediction shown as output turns out to be wrong (see figure).

Is there any way to fix this?

liangfu · December 23, 2019, 11:05am

With a simple fix in PR #4574, you can get the correct results. Hooray !

liangfu · December 24, 2019, 2:56am

For the hints to speed up simulation with TSIM:

Set single timing measurement for TSIM based simulation:

num = 1 # number of times we run module for a single measurement 
rep = 1 # number of measurements (we derive std dev from this)

, or simply run the module with

m.run()

instead of running the module in the time_evaluator.

Build Chisel VTA with default config, which means

make DEBUG=0 USE_TRACE=0 lib

Don’t use debug_runtime as graph_runtime, as it takes a lot more cycles in TSIM.

kevinyuan · January 5, 2020, 9:35am

Hi @liangfu,

I got similiar result as @BharathKinnal due to assert(cat_detected) failed.

With the fix in PR #4574, I got a compilation error since the aluBits is not defined in class TensoreAlu.

After adding the following code at line 111, the compilation went through, but the simulation still failed with assert(cat_detected).

  *val aluBits = p(CoreKey).accBits*

Did you see the same issue on your end ?

Best regards.

liangfu · January 5, 2020, 10:14am

Hi @kevinyuan, thanks for reporting. Sorry for introducing the mistake to leave the aluBits undefined in the module. Can you bring a PR to fix that? However, I can have a successful detection of cat with the simple fix. Let’s find out why you cannot have cats detected with TSIM backend.

kevinyuan · January 5, 2020, 1:41pm

Hi @liangfu,

PR #4624 created.