Deploy NNVM module using C++ on GPU using OpenCL target

Hi, I am referring to C++ deployment instructions to deploy a compiled NNVM graph on my laptop GPU using C++. When set device_type to kDLOpenCL I get Segfault after reading the input from a binary file into DLTensor as in the same example. Here is the setup

Compilation
-----------------
Development PC : x86_64 with NVIDIA 920M
TVM runtime is compiled with OpenCL and CUDA enabled.
NNVM graph is built with target = 'opencl', target_host = 'llvm'
---------------------------------------------------------------------------------
Deployment
----------------
  int dtype_code = kDLFloat;
  int dtype_bits = 32;
  int dtype_lanes = 1;
  int device_type = kDLOpenCL;
  int device_id = 0;

The source code that is causing Segfault is 
data_fin.read(static_cast<char*>(x->data), 3 * 224 * 224 * 4);

On the same PC, I compiled the graph for the CPU (target=‘llvm’, target_host=‘llvm’) and I am able to deploy the exported module using C++ with device_type = kDLCPU. The segfault occurs when deploying on GPU. Below is the log

[15:01:55] src/runtime/opencl/opencl_device_api.cc:231: Multiple OpenCL platforms matched, use the first one ...
[15:01:55] src/runtime/opencl/opencl_device_api.cc:234: Initialize OpenCL platform 'NVIDIA CUDA'
[New Thread 0x7ffff3923700 (LWP 25361)]
[New Thread 0x7ffff3122700 (LWP 25362)]
[New Thread 0x7ffff2921700 (LWP 25363)]
[New Thread 0x7ffff2120700 (LWP 25364)]
[New Thread 0x7ffff191f700 (LWP 25365)]
[New Thread 0x7ffff111e700 (LWP 25366)]
[New Thread 0x7ffff091d700 (LWP 25367)]
[15:01:55] src/runtime/opencl/opencl_device_api.cc:259: opencl(0)='GeForce 920M' cl_device_id=0x6801a0

Thread 1 "ssd_nnvm_demo" received signal SIGSEGV, Segmentation fault.
0x000000000040acf5 in tvm::runtime::Module::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()

Is there any different way to read the data from a binary file into the input DLTensor that is allocated on GPU?

Thanks.

you should first read your data into CPU tensor, and then copy the CPU tensor to GPU tensor by TVMArrayCopyFromTo(…).

@masahi Thanks a lot. Got it working !

Is there example somewhere shows how to create a GPU tensor? and how to set the stream in this case?

@masahi Hi am facing same issue , is there a sample code on how to transfer tensors between hosts and device ?

Sure I extracted a sample from my code and put it here.

1 Like

@masahi Thanks for the sample .
In your sample code is “tvm_input” the CPU byte array copied to “x” (GPU array) ?
means is TVMArrayCopyFromBytes(destination,source,size) ?

for (int i = 0; i < n_samples; ++i) {
	TVMArrayCopyFromBytes(x, &tvm_input[i * in_size], in_size * sizeof(float));
	set_input(input_name.c_str(), x);
	run();
    get_output(0, y);
	TVMArrayCopyToBytes(y, &tvm_output[i * out_size],  out_size * sizeof(float));
}

yes, source is on cpu and x is on gpu. In my code, tvm_input should contain input data coming from your input image, for example.

@masahi
instead of tvm_input i created a DLTensor ‘z’ of device type = kdCPU , for storing input image .
I did copy byte array as shown below .

https://gist.github.com/rajh619/74538a7b3a7e1b89a2ae89db5ab24054

but the i couldnt find the copied bytes in the destination (x->data) .?

If you already have your data in DLtensor, you should use TVMArrayCopyFromTo.

@masahi
ok .now i tried with TVMArrayCopyFromTo
TVMArrayCopyFromTo(z, x, nullptr);
the same issue happens , i couldnt find the bytes copied to x->data .
I think x->data should be same as z->data(image data) . please correct me if am wrong ?

if x is on GPU, you should NEVER touch x->data. You either get segfault or complete junk.
If you want to access x->data, copy x to CPU first.

@masahi Thanks . got it working . i copied back from GPU to CPU(x->data to k->data) and validated the data.
After executing "run() " , i was able to get output to CPU in two ways :

  1. allocate tvm array to output tensor “y” with devicetype - CPU (1) , then tvm_output(0,y) . y->data contains output . ( i think internally tvm copies the output from device to cpu_host ?)
  2. allocate tvm array to output tensor “y” with devicetype - GPU (4) , tvm_output(0,y) ,then copy bytes from GPU to CPU ->out_vector[] . (similar to your sample code) .
    Out of both which is the right way to extract output ?

The answer is 2.

See my sample code. The output y is GPU tensor. I copy y to tvm_output, which is on cpu.

2 Likes

Hi, @masahi I am still getting segfault even I use your sample code also after allocating memory i am doing memeset to 0

DLTensor* x = nullptr;
DLTensor* y = nullptr;
const int in_ndim = 4;
const int out_ndim = in_ndim;
const int num_slice = 1;
const int num_class = 4;
const int shrink_size[] = { 256, 256 };
const int64_t in_shape[] = { num_slice, 1, shrink_size[0], shrink_size[1] };
const int64_t out_shape[] = { num_slice, num_class, shrink_size[0], shrink_size[1] };
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);

memset(x->data, 0, 4265265);

it is happing for Cuda and OpenCL for llvm it’s working fine.

you can’t use memset on GPU memory.

hi, @masahi still i am not able working still it throws memory error.

void FR_TVM_Deploy::forward(float* imgData)
{

int in_size = (1 * 64 * 64 * 3 * 4);

constexpr int dtype_code = kDLFloat;
constexpr int dtype_bits = 32;
constexpr int dtype_lanes = 1;
constexpr int device_type = kDLCPU;
constexpr int device_id = 0;
constexpr int in_ndim = 4;
const int64_t in_shape[in_ndim] = {1, 64, 64, 3};
//Allocating memeory to DLTensor object
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
TVMArrayCopyFromBytes(input, imgData, in_size);
//Get globl function module for graph runtime
tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
// get the function from the module(set input data)
tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
set_input("input", input);
// get the function from the module(run it)
tvm::runtime::PackedFunc run = mod->GetFunction("run");
run();

int out_ndim = 2;
int64_t out_shape[2] = {1, 256};
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);
// get the function from the module(get output data)
tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");
get_output(0, output);
size_t out_size = out_shape[0] * out_shape[1];
std::vector<float> tvm_output(out_size, 0);
TVMArrayCopyToBytes(output, &tvm_output[out_size], out_size);

TVMArrayFree(input);
TVMArrayFree(output);
} 

when i print the tvm_output vector i am getting all 0’s means output is coming 0, in llvm case i am getting correct output. here i am printing vector tvm_output in loop, is there any othere way to check output?