Deploy NNVM module using C++ on GPU using OpenCL target


Hi, I am referring to C++ deployment instructions to deploy a compiled NNVM graph on my laptop GPU using C++. When set device_type to kDLOpenCL I get Segfault after reading the input from a binary file into DLTensor as in the same example. Here is the setup

Development PC : x86_64 with NVIDIA 920M
TVM runtime is compiled with OpenCL and CUDA enabled.
NNVM graph is built with target = 'opencl', target_host = 'llvm'
  int dtype_code = kDLFloat;
  int dtype_bits = 32;
  int dtype_lanes = 1;
  int device_type = kDLOpenCL;
  int device_id = 0;

The source code that is causing Segfault is<char*>(x->data), 3 * 224 * 224 * 4);

On the same PC, I compiled the graph for the CPU (target=‘llvm’, target_host=‘llvm’) and I am able to deploy the exported module using C++ with device_type = kDLCPU. The segfault occurs when deploying on GPU. Below is the log

[15:01:55] src/runtime/opencl/ Multiple OpenCL platforms matched, use the first one ...
[15:01:55] src/runtime/opencl/ Initialize OpenCL platform 'NVIDIA CUDA'
[New Thread 0x7ffff3923700 (LWP 25361)]
[New Thread 0x7ffff3122700 (LWP 25362)]
[New Thread 0x7ffff2921700 (LWP 25363)]
[New Thread 0x7ffff2120700 (LWP 25364)]
[New Thread 0x7ffff191f700 (LWP 25365)]
[New Thread 0x7ffff111e700 (LWP 25366)]
[New Thread 0x7ffff091d700 (LWP 25367)]
[15:01:55] src/runtime/opencl/ opencl(0)='GeForce 920M' cl_device_id=0x6801a0

Thread 1 "ssd_nnvm_demo" received signal SIGSEGV, Segmentation fault.
0x000000000040acf5 in tvm::runtime::Module::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()

Is there any different way to read the data from a binary file into the input DLTensor that is allocated on GPU?


Deploying the tuned tflite graph with c++
can't find the header file

you should first read your data into CPU tensor, and then copy the CPU tensor to GPU tensor by TVMArrayCopyFromTo(…).


@masahi Thanks a lot. Got it working !

In c++ opencl API. How to let gpu "DLTensor *x" get the input cpu data using mali gpu?

Is there example somewhere shows how to create a GPU tensor? and how to set the stream in this case?


@masahi Hi am facing same issue , is there a sample code on how to transfer tensors between hosts and device ?


Sure I extracted a sample from my code and put it here.


@masahi Thanks for the sample .
In your sample code is “tvm_input” the CPU byte array copied to “x” (GPU array) ?
means is TVMArrayCopyFromBytes(destination,source,size) ?

for (int i = 0; i < n_samples; ++i) {
	TVMArrayCopyFromBytes(x, &tvm_input[i * in_size], in_size * sizeof(float));
	set_input(input_name.c_str(), x);
    get_output(0, y);
	TVMArrayCopyToBytes(y, &tvm_output[i * out_size],  out_size * sizeof(float));

C++ deploy example: switch to OpenCL

yes, source is on cpu and x is on gpu. In my code, tvm_input should contain input data coming from your input image, for example.


instead of tvm_input i created a DLTensor ‘z’ of device type = kdCPU , for storing input image .
I did copy byte array as shown below .

but the i couldnt find the copied bytes in the destination (x->data) .?


If you already have your data in DLtensor, you should use TVMArrayCopyFromTo.


ok .now i tried with TVMArrayCopyFromTo
TVMArrayCopyFromTo(z, x, nullptr);
the same issue happens , i couldnt find the bytes copied to x->data .
I think x->data should be same as z->data(image data) . please correct me if am wrong ?


if x is on GPU, you should NEVER touch x->data. You either get segfault or complete junk.
If you want to access x->data, copy x to CPU first.


@masahi Thanks . got it working . i copied back from GPU to CPU(x->data to k->data) and validated the data.
After executing "run() " , i was able to get output to CPU in two ways :

  1. allocate tvm array to output tensor “y” with devicetype - CPU (1) , then tvm_output(0,y) . y->data contains output . ( i think internally tvm copies the output from device to cpu_host ?)
  2. allocate tvm array to output tensor “y” with devicetype - GPU (4) , tvm_output(0,y) ,then copy bytes from GPU to CPU ->out_vector[] . (similar to your sample code) .
    Out of both which is the right way to extract output ?


The answer is 2.

See my sample code. The output y is GPU tensor. I copy y to tvm_output, which is on cpu.


Hi, @masahi I am still getting segfault even I use your sample code also after allocating memory i am doing memeset to 0

DLTensor* x = nullptr;
DLTensor* y = nullptr;
const int in_ndim = 4;
const int out_ndim = in_ndim;
const int num_slice = 1;
const int num_class = 4;
const int shrink_size[] = { 256, 256 };
const int64_t in_shape[] = { num_slice, 1, shrink_size[0], shrink_size[1] };
const int64_t out_shape[] = { num_slice, num_class, shrink_size[0], shrink_size[1] };
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);

memset(x->data, 0, 4265265);

it is happing for Cuda and OpenCL for llvm it’s working fine.


you can’t use memset on GPU memory.


hi, @masahi still i am not able working still it throws memory error.

void FR_TVM_Deploy::forward(float* imgData)

int in_size = (1 * 64 * 64 * 3 * 4);

constexpr int dtype_code = kDLFloat;
constexpr int dtype_bits = 32;
constexpr int dtype_lanes = 1;
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
const int64_t in_shape[in_ndim] = {1, 64, 64, 3};
//Allocating memeory to DLTensor object
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
TVMArrayCopyFromBytes(input, imgData, in_size);
//Get globl function module for graph runtime
tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
// get the function from the module(set input data)
tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
set_input("input", input);
// get the function from the module(run it)
tvm::runtime::PackedFunc run = mod->GetFunction("run");

int out_ndim = 2;
int64_t out_shape[2] = {1, 256};
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);
// get the function from the module(get output data)
tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");
get_output(0, output);
size_t out_size = out_shape[0] * out_shape[1];
std::vector<float> tvm_output(out_size, 0);
TVMArrayCopyToBytes(output, &tvm_output[out_size], out_size);


when i print the tvm_output vector i am getting all 0’s means output is coming 0, in llvm case i am getting correct output. here i am printing vector tvm_output in loop, is there any othere way to check output?

How to deploy tvm model for Cuda and OpenCL to create c++ API's