Malloc() occurring during run() may affect performance

hcho3 · May 10, 2019, 3:29am

I was profiling the graph runtime for memory allocation and noticed that malloc() was being called inside tvm.contrib.graph_runtime.run(). The distribution of allocated size during run() is as follows:

0 - 1 KB: 366 times
1 - 5 KB: 20 times
15 - 20 KB: 18 times
600 - 700 KB: 1 time
800 - 900 KB: 1 time

Here is the call stack of the largest memory allocation, which allocates 844 KB:

/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::CPUDeviceAPI::AllocDataSpace(DLContext, unsigned long, unsigned long, DLDataType)+0x43) [0x7f2325fd3add]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::WorkspacePool::Pool::Alloc(DLContext, tvm::runtime::DeviceAPI*, unsigned long)+0x296) [0x7f2325ff27b8]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::WorkspacePool::AllocWorkspace(DLContext, unsigned long)+0xf9) [0x7f2325ff228f]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::CPUDeviceAPI::AllocWorkspace(DLContext, unsigned long, DLDataType)+0x32) [0x7f2325fd34b4]
/home/ubuntu/tvm/build/libtvm.so(TVMBackendAllocWorkspace+0x6d) [0x7f2325fd0099]
./model/compiled.so(+0x50e6) [0x7f2380f4b0e6]
/home/ubuntu/tvm/build/libtvm.so(+0x1e3d2f9) [0x7f2325fe12f9]
/home/ubuntu/tvm/build/libtvm.so(+0x1e3d6b0) [0x7f2325fe16b0]
/home/ubuntu/tvm/build/libtvm.so(std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x5a) [0x7f23255571f6]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::PackedFunc::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x30) [0x7f23256802a6]
/home/ubuntu/tvm/build/libtvm.so(+0x1e82019) [0x7f2326026019]
/home/ubuntu/tvm/build/libtvm.so(+0x1e846bb) [0x7f23260286bb]
/home/ubuntu/tvm/build/libtvm.so(std::function<void ()>::operator()() const+0x32) [0x7f23254df52c]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::GraphRuntime::Run()+0x79) [0x7f23260238ff]
/home/ubuntu/tvm/build/libtvm.so(+0x1e82ce0) [0x7f2326026ce0]
/home/ubuntu/tvm/build/libtvm.so(+0x1e84ef8) [0x7f2326028ef8]
/home/ubuntu/tvm/build/libtvm.so(std::function<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)>::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x5a) [0x7f23255571f6]
/home/ubuntu/tvm/build/libtvm.so(tvm::runtime::PackedFunc::CallPacked(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x30) [0x7f23256802a6]
/home/ubuntu/tvm/build/libtvm.so(TVMFuncCall+0x95) [0x7f2325fd0225]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f2329cabdae]
/usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f2329cab71f]
/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2b4) [0x7f2329ebf524]
/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x11b93) [0x7f2329ebfb93]
python3(_PyObject_FastCallKeywords+0x19c) [0x5a730c]
python3() [0x503073]
python3(_PyEval_EvalFrameDefault+0x449) [0x506859]
python3() [0x504c28]

Note that the compiled model lib (compiled.so) is making a call to TVMBackendAllocWorkspace(), which then allocated memory via malloc(). Since malloc() is a system call, it may increase the inference latency.

The object dump of compiled.so indeed show that this is the case:

    2125:       48 8b 05 94 3e 23 00    mov    0x233e94(%rip),%rax        # 235fc0 <__TVMBackendAllocWorkspace@@Base-0x88>
    212c:       bf 01 00 00 00          mov    $0x1,%edi
    2131:       ba 2c 9a 09 00          mov    $0x99a2c,%edx
    2136:       b9 02 00 00 00          mov    $0x2,%ecx
    213b:       41 b8 20 00 00 00       mov    $0x20,%r8d
    2141:       44 89 f6                mov    %r14d,%esi
    2144:       ff 10                   callq  *(%rax)

We may want to pre-allocate all buffers to avoid malloc() at inference time.

cc @kevinthesun @yzhliu

How to re-produce: run the following program with the memory tracing library https://github.com/qgears/log-malloc-simple:

import numpy as np
import PIL.Image
import tvm
import nnvm
import topi

input_shape = (1, 3, 224, 224)

with open('./frog.jpg', 'rb') as f:
  image = PIL.Image.open(f).convert('RGB')
  image = np.asarray(image.resize((input_shape[-2], input_shape[-1])), dtype=np.float32)
  image = np.rollaxis(image, axis=2, start=0)[np.newaxis, :]

with open('./model/compiled_model.json', 'r') as f:
  graph = f.read()
lib = tvm.module.load('./model/compiled.so')
with open('./model/compiled.params', 'rb') as f:
  params = nnvm.compiler.load_param_dict(f.read())
ctx = tvm.cpu(0)
m = tvm.contrib.graph_runtime.create(graph, lib, ctx)

m.set_input(**params)
m.set_input('data', tvm.nd.array(image, ctx=ctx))
m.run()
out = m.get_output(0, tvm.nd.empty(shape=(1,1000))).asnumpy()

cc @kevinthesun

tqchen · May 10, 2019, 3:31am

the malloc is only called once during the first run and then the workspace is cached if possible. That is why in standard benchmark need to skip the first iteration(as we did in the time evaluator)