Vulkan Deploy issue with Latest Code

dayanandasiet · August 8, 2018, 1:30pm

Am using below script to test CPU/OpenCL/Vulkan flavor on Android phone, OpenCL and CPU version testing is proper but the vulkan return wrong result, this observation faced with Darknet/Keras frontend also.

import mxnet as mx
import nnvm
import tvm
import numpy as np

######################################################################
# Download Resnet18 model from Gluon Model Zoo
# ---------------------------------------------
# In this section, we download a pretrained imagenet model and classify an image.
from mxnet.gluon.model_zoo.vision import get_model
from mxnet.gluon.utils import download
from PIL import Image
from matplotlib import pyplot as plt
block = get_model('resnet18_v1', pretrained=True)
img_name = 'cat.jpg'
synset_url = ''.join(['https://gist.githubusercontent.com/zhreshold/',
                  '4d0b62f3d01426887599d4f7ede23ee5/raw/',
                  '596b27d23537e5a1b5751d2b0481ef172f58b539/',
                  'imagenet1000_clsid_to_human.txt'])
synset_name = 'synset.txt'
download('https://github.com/dmlc/mxnet.js/blob/master/data/cat.png?raw=true', img_name)
download(synset_url, synset_name)
with open(synset_name) as f:
synset = eval(f.read())
image = Image.open(img_name).resize((224, 224))


def transform_image(image):
image = np.array(image) - np.array([123., 117., 104.])
image /= np.array([58.395, 57.12, 57.375])
image = image.transpose((2, 0, 1))
image = image[np.newaxis, :]
return image

x = transform_image(image)
print('x', x.shape)

######################################################################
# Set Remote connection with Android Phone to run Model on ARM(CPU/OpenCL/Vulkan)
# ---------------------------------
import os
from tvm.contrib import graph_runtime, ndk, util
from tvm import rpc

exec_flavor = 'vulkan'

# Set to be address of tvm proxy.
tracker_host = os.environ["TVM_TRACKER_HOST"]
tracker_port = int(os.environ["TVM_TRACKER_PORT"])
key = "android"

# Change target configuration.
# Run `adb shell cat /proc/cpuinfo` to find the arch.
arch = "arm64"

# connect to the proxy
tracker = rpc.connect_tracker(tracker_host, tracker_port)
remote = tracker.request(key, priority=0,
                     session_timeout=60)

if exec_flavor == 'cpu':
# Mobile CPU
target = 'llvm -target=%s-linux-android' % arch
target_host = None
ctx = remote.cpu(0)
elif exec_flavor == 'opencl':
# Mobile GPU
target = 'opencl'
target_host = 'llvm -target=%s-linux-android' % arch
ctx = remote.cl(0)
elif exec_flavor == 'vulkan':
# Mobile GPU
target = 'vulkan'
target_host = 'llvm -target=%s-linux-android' % arch
ctx = remote.vulkan(0)

######################################################################
# Compile the Graph
# -----------------
# Now we would like to port the Gluon model to a portable computational graph.
# It's as easy as several lines.
# We support MXNet static graph(symbol) and HybridBlock in mxnet.gluon
shape_dict = {'data': x.shape}
#with nnvm.compiler.build_config(opt_level=3):
sym, params = nnvm.frontend.from_mxnet(block)
sym = nnvm.sym.softmax(sym)

######################################################################
# now compile the graph
import nnvm.compiler
graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, params=params, target_host=target_host)

######################################################################
# Execute the portable graph on TVM
# ---------------------------------
temp = util.tempdir()
path_dso = temp.relpath('dev_lib.so')
lib.export_library(path_dso, ndk.create_shared)
remote.upload(path_dso)
rlib = remote.load_module('dev_lib.so')

rmodule = graph_runtime.create(graph, rlib, ctx)
# set inputs
rmodule.set_input('data', tvm.nd.array(x.astype(dtype), ctx=ctx))
rmodule.set_input(**params)
rmodule.run()
# get outputs
out = rmodule.get_output(0, tvm.nd.empty((1000,), dtype=dtype, ctx=ctx))
top1 = np.argmax(out.asnumpy())
print('TVM prediction ARM top-1:', top1, synset[top1])

this problem not appear on operator verification, only observe commercial trained model.

dayanandasiet · August 9, 2018, 7:49am

@yzhliu @eqy

vulkan copy or handle recent changes happen?

eqy · August 9, 2018, 6:42pm

Can you reproduce this with models that are in NNVM, e.g., ResNet-18?

dayanandasiet · August 9, 2018, 8:26pm

I extended from_mxnet.py to script as mentioned in issue description to test on Vulkan target in Nexus 6P android phone, with same script on CPU works fine, when run on Vulkan target result will be inconsistent compare with CPU target output. I verified same script to test on Mate 9 android phone on OpenCL and CPU target, in this case output are matching. This issue i didn’t observe while push my first commit on PR 1488.

dayanandasiet · August 14, 2018, 9:11am

@eqy

After debug the model by layer wise found after 5-6 Convolotion layer Vulkan target output is wrong, randomly output will get proper and most time failure is most result. Am using Nexus 6P for vulkan target to test, any suggestion or direction to look more on solution.

eqy · August 14, 2018, 6:58pm

Thanks for looking into this; we currently have a few Vulkan phones (Pixel 2) that we will use to try to reproduce this issue.

headupinclouds · January 6, 2019, 11:09pm

I’m seeing the same thing on a Samsung Galaxy S7 using the from_mxnet.py tutorial script running the pre-trained resnet 18 + cat example. The CPU build works fine. I will try to troubleshoot this, but I’m curious if there are any updates on Vulkan + Android. The same example runs fine on a local Ubuntu 18.04 host.

tvm commit: https://github.com/dmlc/tvm/commit/0f053c82a747b4dcdf49570ec87c17e0067b7439

relates to Huawei P20 Pro Vulkan output is not proper

[EDIT: Working example added to reproduce the issue in github repo here]

tqchen · December 30, 2018, 10:07pm

Just opened https://github.com/dmlc/tvm/issues/2355 to track this