TVM on Windows - Tips(?) and Feedback

Now that I’ve spent some time with TVM with Windows, I wanted to describe what I got working, and how in case it can help others…and maybe others can tell me what I did wrong in any case.

Compiling
This went pretty smoothly if you are already familiar with CMake. LLVM was a monster to build but went without a hitch. CMake GUI worked out of the box, just don’t forget to set your LLVM_DIR to your build directory like : build\llvm\winx64\lib\cmake\llvm

If you use GraphRuntime, there is a small issue using runtime/graph_runtime.h and linker errors. MSVC does not export symbols by default, but setting WINDOWS_EXPORT_ALL_SYMBOLS with the tvm_runtime dll, CMAKE will do this for you and fixed my linker issues.

Python
First thing I did was set my PATH to include the TVM release directory that contained the compiled tvm dlls (build\tvm\winx64\Release).
I have python 3.7.x installed, and following the directions of running “python setup.py install --user” went without a hitch.

Autotuning
This appears to be broken on Windows out of the box. You need a real linux box if you plan on doing anything with CUDA without fuss. WSL seemed to work fine out of the box with CPU autotuning.

The RPC tracker and server would not work with INADDR_ANY (0.0.0.0) and had to specifically use the local computers IP address. I had to do this on Linux also.

I was able to get autotuning to work on Windows, but it took a lot of debugging and editing, mostly of things I was unfamiliar with in Python.

For instance, the thread.Start here would deadlock and never hit the next line (the thread.join), and never ran the thread entry point. I did verify python threads did work by writing some test code, but that specific point deadlocked.

I literally went line by line, hacking up python code until nvcc kicked in and it started working.

XGBoost would not work at all (some crash about a pickle) and used gridsearch instead.

Compiling Model:
relay would get stuck in a tight, infinite loop somewhere in thread_pool.cc (constantly hitting thread yields). This was always fixed by setting the TVM_NUM_THREADS=1.

My work flow with Windows
Honestly gave TVM on Windows a hard try as I need to at least compile the modules for Windows and cuda. The only reliable way is to autotune on a real Linux machine, in which the TVM docs pretty much worked as prescribed. I then copy the tuned [network].log file to my Windows machine. I set TVM_NUM_THREADS=1 (important!). Then I run the exact same script from the Linux autotune, but comment the “tune_tasks” line, skipping the problematic Windows issues, leaving me with an optimized model for Windows.

If there is a better way, I’d really love to hear it. Hopefully this can help someone else.

Thank you for this project. It’s absolutely amazing!! So far I haven’t gave it a model where I didn’t get at least 3x perf increase!

4 Likes

Thank you @jmorrill for the notes, and useful insights. Would you be interested in adding your lessons on Windows into the TVM’s website docs? That would give you an opportunity to contribute to the TVM project and give your lessons more visibility.

Thanks for sharing your insights! It would be great if you can contribute a tutorial on how to deploy on windows. cc @yidawang @haichen on the spinning threadpool issue on windows

1 Like

Thanks @jmorrill! Would you be able to share some more details about the model you’re testing with? I have been using TVM on Windows for a while now, and have been successfully compiling and running models with TVM_NUM_THREADS > 1. I’d be curious to repro and debug the issue.

@jonso Here is a mxnet resnet-50 that relay gets stuck in the thread pool with on Windows.
http://insightface.ai/files/models/retinaface_r50_v1.zip

Would be very curious if you can reproduce!

A side note, If you try to load it up with nnvm api, it gets stuck in a loop in python code in the from_mxnet part. I gave up on that quickly as I believe relay is all the new stuff.

thanks for sharing! I used .log after auto-tune on ubuntu to generate .dll on windows 64bits and can be deployed with C++ on windows successfully.

But there are problems when I wanted to tried on windows 32bits. Does tvm support 32bit on windows for deployment?
File “D:\tvm_sep\python\tvm\relay\build_module.py”, line 207, in build
graph_json, mod, params = bld_mod.build(func, target, target_host, params)

File “D:\tvm_sep\python\tvm\relay\build_module.py”, line 108, in build
self._build(func, target, target_host)

File “D:\tvm_sep\python\tvm_ffi_ctypes\function.py”, line 209, in call
ctypes.byref(ret_val), ctypes.byref(ret_tcode)) != 0:

OSError: exception: access violation writing 0xBF000000

Thanks a lot!

@jmorrill Thanks for sharing!

Can you please elaborate a bit about the thread yielding issue you encountered? I am a bit confused as the run time thread pool shouldn’t play a role in the compilation.

I can’t test at the moment, but I believe its on the line:

tasks = autotvm.task.extract_from_program(mod[“main”], target=target,
params=params, ops=(relay.op.nn.conv2d,))

Now that you had me re-read it, it appears I probably didnt need that line for compilation as that looks like it’s related to the auto tune. Apologies if that is the case!

OK. It makes sense now. Auto tuning would definitely need the runtime thread pool. However, it seems that this issue didn’t appear in the final model inference running, correct? i.e. when you run the model inference after compilation, you are able to use multi-thread.

I’ve been through a lot of configurations and tests in Windows over the last few weeks, so I hope I’m not confusing anything here:

I believe I did a CPU inference of a mobilenet based model without issue early on, auto tuned using the nnvm api.

After that, I mostly focused on cuda autotuning and inferences, but had trouble loading the resnet-50 and used relay instead.

I’ll do a few quick tests tomorrow and maybe over the weekend so I can give more solid answers :).

@jmorrill Thank you so much. Your post saved lots of time for me. I managed to compile TVM on Windows Server 2019 with the following command:

cmake .. -G"Visual Studio 16 2019" -A x64 ^
-DLLVM_DIR="C:\Users\Administrator\Desktop\llvm-9.0.0.src\build\lib\cmake\llvm" ^
-DCMAKE_BUILD_TYPE=Release -DCMAKE_CONFIGURATION_TYPES="Release" ^
-DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=TRUE
1 Like

Two more details:

  • When compiling a model, I got the error RuntimeError: Can not find cl.exe,please run this in Vistual Studio Command Prompt. This can be solved by using the shell “x64 Native Tools Command Prompt for VS 2019”.
  • I also got error “lld-linker.exe not found”. Still working on this error.

I’ve since pulled the latest and the threadpool is no longer spinning, but some of the autotvm code that creates a new python thread (check_remote is measure_methods.py) seem to deadlock, possibly with the GIL.

The fix was TVM_NUM_THREADS=1, but found out yesterday enabling open mp does the trick also.

I’m still doing some experimenting on Windows with auto tuning. I can get xgb tuner to work, but because of the lack of a real fork() on Windows and a lack of real threads in python (still new to the language), its a bit slow.

1 Like

Update: I initially compiled LLVM from the tarball http://releases.llvm.org/9.0.0/llvm-9.0.0.src.tar.xz, and it did not include LLD (lld-link.exe). So instead I had to check out the git repository:

git clone https://github.com/llvm/llvm-project.git -b llvmorg-9.0.0 --recursive
cd llvm-project\llvm
mkdir build
cd build
cmake -G"Visual Studio 16 2019" -A x64 -DLLVM_ENABLE_PROJECTS=lld ..

(Note that git clone will take a while, since LLVM is a big project.)
Then I compiled the generated solution LLVM.sln, and this time lld-link.exe was produced.

And don’t forget to add llvm-project\llvm\build\Release\bin to the system PATH.

Example script:

import numpy as np

from tvm import relay
from tvm.relay import testing
import tvm
from tvm.contrib import graph_runtime

batch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)

mod, params = relay.testing.resnet.get_workload(
    num_layers=18, batch_size=batch_size, image_shape=image_shape)

opt_level = 3
target = tvm.target.create('llvm')
with relay.build_config(opt_level=opt_level):
    graph, lib, params = relay.build_module.build(
            mod, target, params=params)

lib.export_library('./deploy.dll')
with open('./deploy_graph.json', 'w') as f:
    f.write(graph)
with open('./deploy_param.params', 'wb') as f:
    f.write(relay.save_param_dict(params))

# create random input
ctx = tvm.cpu()
data = np.random.uniform(-1, 1, size=data_shape).astype('float32')

# load the module back.
with open('./deploy_graph.json', 'r') as f:
    loaded_json = f.read()
loaded_lib = tvm.module.load('./deploy.dll')
with open('./deploy_param.params', 'rb') as f:
    loaded_params = bytearray(f.read())
input_data = tvm.nd.array(np.random.uniform(size=data_shape).astype("float32"))

module = graph_runtime.create(loaded_json, loaded_lib, ctx)
module.load_params(loaded_params)
module.run(data=input_data)
out_deploy = module.get_output(0).asnumpy()

# Print first 10 elements of output
print(out_deploy.flatten()[0:10])

1 Like