Slow graph_runtime.create for LLVM target - Culprit LazyInitJIT

janimesh · May 1, 2019, 9:59pm

Hi,

@kevinthesun and I are witnessing very slow graph_runtime creation for some test cases (>10 min). The behavior can be reproduced with this tutorial - https://docs.tvm.ai/tutorials/frontend/deploy_ssd_gluoncv.html

After profiling, we found the culprit was

m = graph_runtime.create(graph, lib, ctx)

With deeper profiling, we found that the execution path goes through the llvm_module.cc and calls LazyInitJIT

github.com

dmlc/tvm/blob/master/src/codegen/llvm/llvm_module.cc#L65


PackedFunc GetFunction(
    const std::string& name,
    const std::shared_ptr<ModuleNode>& sptr_to_self) final {
  if (name == "__tvm_is_system_module") {
    bool flag =
        (mptr_->getFunction("__tvm_module_startup") != nullptr);
    return PackedFunc([flag](TVMArgs args, TVMRetValue *rv) {
        * rv = flag;
      });
  }
  if (ee_ == nullptr) LazyInitJIT();
  std::lock_guard<std::mutex> lock(mutex_);
  const std::string& fname = (name == runtime::symbol::tvm_module_main ?
                              entry_func_ : name);


  BackendPackedCFunc faddr =
      reinterpret_cast<BackendPackedCFunc>(GetFunctionAddr(fname));
  if (faddr == nullptr) return PackedFunc();
  return WrapPackedFunc(faddr, sptr_to_self);
}

Almost all the time is spent in this function. Note that this function is called during graph_runtime.create after we have already compiled the model. I am confused as to why this function is needed here. But more importantly, why does it take so much time?

Is anybody experiencing/solved this issue?

tqchen · May 1, 2019, 11:43pm

LazyJIT initialize the LLVM JIT to create LLVM execution engine. This could due to some crazy llvm optimization going on when hitting the jit compilation. One way to get around is to switch to AOT mode to save the library to a so file before loading it back

janimesh · May 2, 2019, 2:41am

I see. I will try changing the optimization level in this function to see if it affects total time.

Also, is there any tutorial to completely switch to AOT mode? Can I use this - https://docs.tvm.ai/tutorials/relay_quick_start.html#save-and-load-compiled-module

tqchen · May 2, 2019, 2:00pm

Yes, you just need to save the lib via export_library and load it back again.

janimesh · May 3, 2019, 1:26am

Thanks @tqchen. I realized that our usecase require us to compile everytime, so in this case AOT will also suffer the same fate. We checked by dumping the “so” files, the slowdown just shifted to “so” creation.

However, I might have been able to pinpoint atleast a portion of the problem. I am able to regenerate the problem with just max_pool2d. The test case is

import tvm
from tvm import relay
from tvm.relay.testing import create_workload
from tvm.contrib import graph_runtime
import time



def compile_graph(simple_net, t, config=None):
    target = tvm.target.create(t)
    func = relay.Function(relay.ir_pass.free_vars(simple_net), simple_net)
    func, params = create_workload(func)
    if config is None:
        config = {"opt_level" : 3}
    with relay.build_config(**config):
        artifacts = relay.build(func, target=target, params=params)
    if t == 'llvm':
        ctx = tvm.cpu()
        graph, libs, params = artifacts
        module = graph_runtime.create(graph, libs, ctx)

def test_pool(batch_size, in_channel, height, width, kh, kw, stride, padding,
        dtype):
    data_shape = (1, height, width, in_channel)
    data = relay.var("data", shape=data_shape, dtype=dtype)
    simple_net = relay.nn.max_pool2d(data=data,
                                pool_size=(kh, kw),
                                strides=(stride, stride),
                                padding=padding,
                                layout='NHWC')
    compile_graph(simple_net, 'llvm')


start = time.time()
test_pool(batch_size=1, in_channel=288, height=35, width=35, kh=3, kw=3,
          stride=2, padding=(0, 0, 0, 0), dtype="float32")
end = time.time()
print("Float32 took", str(end - start), "seconds")



start = time.time()
test_pool(batch_size=1, in_channel=288, height=35, width=35, kh=3, kw=3,
          stride=2, padding=(0, 0, 0, 0), dtype="float16")
end = time.time()
print("Float16 took", str(end - start), "seconds")

The above test case on my machine takes
Float32 --> 2 seconds
Float16 --> more than 10 min

If I use the original compute as the default schedule for pool2d, then atleast for my test case, the compilation time reduces to mere milliseconds.

Vectorization might be the main reason here.