No speed increase when converting model to TVM

cyrusbehr · September 9, 2019, 12:41am

I am trying to convert an MXNET model to TVM in order to improve the inference speed. I am able to convert it successfully, however I do not experience the improvements in speed which are advertised on this page

I have followed the tutorial here, but I will go through the steps I took.

I first downloaded the Insightface model LResNet100E-IR,ArcFace@ms1m-refine-v2 which can be found here. Note that I am using the same model from the TVM benchmark.

Next, I use the following python script to convert the model to the TVM compatible models
Note that when I run the command llc --version I get the following output (which is why I set the target to skylake)

LLVM (http://llvm.org/):
  LLVM version 6.0.0
  
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: skylake

Python conversion script

from tvm.contrib import graph_runtime
import mxnet as mx
from mxnet import ndarray as nd
import nnvm.compiler
import nnvm.testing
import tvm

prefix,epoch = "/home/nchafni/Cyrus/models/faceDetection/Insightface/model-r100-ii/model",0
sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
opt_level = 3

shape_dict = {'data': (1, 3, 112, 112)}
target = tvm.target.create("llvm -mcpu=skylake")
#target = tvm.target.intel_graphics()
nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params)
with nnvm.compiler.build_config(opt_level=opt_level):
   graph, lib, params = nnvm.compiler.build(nnvm_sym, target, shape_dict, params=nnvm_params)
lib.export_library("./deploy_lib.so")
print('lib export succeefully')
with open("./deploy_graph.json", "w") as fo:
   fo.write(graph.json())
with open("./deploy_param.params", "wb") as fo:
   fo.write(nnvm.compiler.save_param_dict(params))

When I run the script, I get the following warning messages:

Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 3, 112, 112, 'float32'), (64, 3, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 112, 112, 'float32'), (64, 64, 1, 1, 'float32'), (2, 2), (0, 0), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 64, 56, 56, 'float32'), (128, 64, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 128, 56, 56, 'float32'), (128, 128, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 128, 28, 28, 'float32'), (256, 128, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 256, 28, 28, 'float32'), (256, 256, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 256, 14, 14, 'float32'), (512, 256, 3, 3, 'float32'), (1, 1), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -mcpu=skylake, workload=('conv2d', (1, 512, 14, 14, 'float32'), (512, 512, 3, 3, 'float32'), (2, 2), (1, 1), (1, 1), 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=llvm -mcpu=skylake, workload=('dense', (1, 25088, 'float32'), (512, 25088, 'float32'), (512, 'float32'), 0). A fallback configuration is used, which may bring great performance regression.
lib export succeefully

but it ultimately exports the models successfully.

Next, I import the the converted models and deploy_lib.so into my C++ project. I am using the following code. The majority of the code is taken from the example on this page

#include <chrono>
#include <iostream>
#include <fstream>
#include "opencv2/opencv.hpp"
#include "tvm/runtime/module.h"
#include "tvm/runtime/registry.h"
#include "tvm/runtime/packed_func.h"

typedef std::chrono::high_resolution_clock Clock;

class FR_MFN_Deploy{
private:
    void * handle;

public:
    FR_MFN_Deploy(std::string modelFolder)
    {
        tvm::runtime::Module mod_syslib = tvm::runtime::Module::LoadFromFile("/home/nchafni/Cyrus/tvm_test/lib/deploy_lib.so");
        //load graph
        std::string modelPath = modelFolder + "/deploy_graph.json";
        std::ifstream json_in(modelPath);
        std::string json_data((std::istreambuf_iterator<char>(json_in)), std::istreambuf_iterator<char>());
        json_in.close();
        int device_type = kDLCPU;
        int device_id = 0;
        // get global function module for graph runtime
        tvm::runtime::Module mod = (*tvm::runtime::Registry::Get("tvm.graph_runtime.create"))(json_data, mod_syslib, device_type, device_id);
        this->handle = new tvm::runtime::Module(mod);
        //load param
        std::ifstream params_in(modelFolder + "/deploy_param.params", std::ios::binary);
        std::string params_data((std::istreambuf_iterator<char>(params_in)), std::istreambuf_iterator<char>());
        params_in.close();
        TVMByteArray params_arr;
        params_arr.data = params_data.c_str();
        params_arr.size = params_data.length();
        tvm::runtime::PackedFunc load_params = mod.GetFunction("load_params");
        load_params(params_arr);
    }


    cv::Mat forward(cv::Mat inputImageAligned)
    {
        //mobilefacnet preprocess has been written in graph.
        cv::Mat tensor = cv::dnn::blobFromImage(inputImageAligned,1.0,cv::Size(112,112),cv::Scalar(0,0,0),true);
        //convert uint8 to float32 and convert to RGB via opencv dnn function
        DLTensor* input;
        constexpr int dtype_code = kDLFloat;
        constexpr int dtype_bits = 32;
        constexpr int dtype_lanes = 1;
        constexpr int device_type = kDLCPU;
        constexpr int device_id = 0;
        constexpr int in_ndim = 4;
        const int64_t in_shape[in_ndim] = {1, 3, 112, 112};
        TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//
        TVMArrayCopyFromBytes(input,tensor.data,112*3*112*4);
        tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;
        tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");
        set_input("data", input);
        tvm::runtime::PackedFunc run = mod->GetFunction("run");
        run();
        tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");
        tvm::runtime::NDArray res = get_output(0);
        cv::Mat vector(512,1,CV_32F);
        memcpy(vector.data,res->data,512*4);
        cv::Mat _l2;
        cv::multiply(vector,vector,_l2);
        float l2 =  cv::sqrt(cv::sum(_l2).val[0]);
        vector = vector / l2;
        TVMArrayFree(input);
        return vector;
    }

};

inline float CosineDistance(const cv::Mat &v1,const cv::Mat &v2){
    return static_cast<float>(v1.dot(v2));
}


cv::Mat getTemplate(const std::string& imagePath, FR_MFN_Deploy& deploy) {
    cv::Mat data = cv::imread(imagePath);
    auto time_1 = Clock::now();
    cv::Mat out = deploy.forward(data);
    auto time_2 = Clock::now();
    std::cout << std::to_string(std::chrono::duration_cast<std::chrono::milliseconds>(time_2 - time_1).count()) << std::endl;
    return out;
}


int main() {
    std::cout << "Loading the model" << std::endl;
    FR_MFN_Deploy deploy("../models");
    std::cout << "Loaded model" << std::endl;

    // Different People
//    std::vector<std::string> imagePaths = {
//            "../images/chip17.jpg",
//            "../images/chip18.jpg",
//            "../images/chip19.jpg",
//            "../images/chip20.jpg",
//            "../images/chip21.jpg",
//            "../images/chip22.jpg",
//            "../images/chip23.jpg",
//    };

// Same person
    std::vector<std::string> imagePaths = {
            "../images/chip1.jpg",
            "../images/chip2.jpg",
            "../images/chip3.jpg",
            "../images/chip4.jpg",
            "../images/chip5.jpg",
            "../images/chip6.jpg",
            "../images/chip7.jpg",
            "../images/chip8.jpg",
            "../images/chip9.jpg",
            "../images/chip10.jpg",
            "../images/chip11.jpg",
            "../images/chip12.jpg",
            "../images/chip13.jpg",
            "../images/chip14.jpg",
            "../images/chip15.jpg",
            "../images/chip16.jpg",
    };

    std::vector<cv::Mat> res;
    std::vector<float> scoresVec;

    for (const auto& path: imagePaths) {
        res.emplace_back(getTemplate(path, deploy));
    }

    for (size_t i = 0; i < res.size(); i++) {
        for (size_t k = i + 1; k < res.size(); k++) {
            auto score = CosineDistance(res[i],res[k]);
            if (score < 0) {
                score = 0;
            }
            scoresVec.emplace_back(score);
        }
    }

    double total = 0;
    for (int i = 0; i < scoresVec.size(); ++i) {
        total +=  scoresVec[i];
        std::cout << scoresVec[i] << std::endl;
    }

    std::cout << "Total score: " << total << "\n";

    return 0;
}

Note that the images I am provided are pre-aligned and cropped to 112x112.

On average, the inference takes 360ms, which is roughly the same time it takes to perform inference using MXNET (C++, MKLDNN). I was expecting to see a significant decrease in inference time.

I am not sure if the issue has to do with the warnings during the conversion? I followed the conversion tutorial exactly and the tutorial did not mention needing to fine tune the model or anything.

Here is the output of cat /proc/cpuinfo to understand what hardward I am running the benchmark on:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz
stepping	: 9
microcode	: 0xb4
cpu MHz		: 1407.008
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 5424.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

kevinthesun · September 9, 2019, 1:45am

Have you tried autotune your model?

cyrusbehr · September 9, 2019, 4:01am

I have not tried autonuning it.
How can I modify the tuning script to work with the 100 layer resnet which can be found here

Thank you for the help

comaniac · September 9, 2019, 4:18am

Two suggestions:

Use the target " llvm -mcpu=skylake-avx512".
With @kevinthesun’s suggestion, you could modify the part of mod, params, data_shape, out_shape = get_network(model_name, batch_size) in the script to create tuning tasks for your model. You may want to change n_trial=len(task.config_space) to a smaller value first so that you could get the result in a shorter time to see if it is effective.

cyrusbehr · September 9, 2019, 11:32pm

Do you have example code of loading and tuning a resnet? The sample code uses a pretrained network, but I’m not sure how to load my own network from a path. Is there an example anywhere (sorry I am a beginner)

As in I have my model.params and model.json (100 layer resnet) and would like to load these and tune these.

In the sample code they are using mod, params = relay.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size, dtype=dtype)

cyrusbehr · September 9, 2019, 11:57pm

Additionally, for the 100 layer resnet linked above, do you know what main should be set to in net = mod["main"]
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params, ops=(relay.op.nn.conv2d,))
tune_graph(mod["main"], data_shape, log_file, graph_opt_sch_file)

cyrusbehr · September 10, 2019, 12:00am

I am using the following function to load the symbol and params . Not sure if this is correct. Once again, I apologize for my limited experience.

def generate_graph(graph_fn, params_fn):
    input_shape = (1, 3, 112, 112)
    output_shape = (1, 512)
    # Derive the TVM target
    prefix,epoch = "/home/nchafni/Cyrus/models/faceDetection/Insightface/model-r100-ii/model",0
    sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
    opt_level = 3

    shape_dict = {'data': (1, 3, 112, 112)}
    target = tvm.target.create("llvm -mcpu=skylake")
    #target = tvm.target.intel_graphics()
    nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params)
    with nnvm.compiler.build_config(opt_level=opt_level):
       graph, lib, params = nnvm.compiler.build(nnvm_sym, target, shape_dict, params=nnvm_params)

    return nnvm_sym, nnvm_params, input_shape, output_shape

comaniac · September 10, 2019, 12:00am

I do agree that the tutorial should be refined and I’ve filed an issue for it.

Meanwhile, here is a code snippet you may refer to for creating a tunable Relay module:

w = relay.var("w", shape=(512, 512), dtype='float32')
x = relay.var("x", shape=(512, 512), dtype='float32')
net = relay.nn.dense(x, w)
module = relay.Module.from_expr(net)
module = relay.transform.InferType()(module)
params = {} // We are working on dense example which doesn't have parameters.
tasks = autotvm.task.extract_from_program(module['main'], target='llvm -mcpu=skylake-avx512',
                                          params=params, ops=(relay.op.nn.dense, ))
tune_kernels(tasks, **tuning_options)

comaniac · September 10, 2019, 12:06am

LGTM. Could you launch AutoTVM before building the module (before the with statement) and see if it works?

cyrusbehr · September 11, 2019, 1:21am

Even after using the following script to autotune the model, I still see no improvements in performance. Any suggestions??

import os
import numpy as np

import nnvm.testing
import nnvm.compiler
import tvm
import mxnet as mx
from tvm import autotvm
import tvm.relay as relay
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
import tvm.contrib.graph_runtime as runtime


def get_network(name, batch_size):
	prefix,epoch = "/home/models/faceDetection/model",0
	sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)
	opt_level = 3
	shape_dict = {'data': (1, 3, 112, 112)}
	nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params)
	input_shape = (batch_size, 3, 112, 112)
	output_shape = (batch_size, 512)
	return nnvm_sym, nnvm_params, input_shape, output_shape

target = "llvm -mcpu=skylake"

batch_size = 1
dtype = "float32"
model_name = "resnet-18"
log_file = "%s.log" % model_name

num_threads = 1
os.environ["TVM_NUM_THREADS"] = str(num_threads)

tuning_option = {
    'log_filename': log_file,
    'tuner': 'random',
    'early_stopping': None,

    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.LocalRunner(number=10, repeat=1,
                                   min_repeat_ms=1000),
    ),
}

# You can skip the implementation of this function for this tutorial.
def tune_kernels(tasks,
                 measure_option,
                 tuner='gridsearch',
                 early_stopping=None,
                 log_filename='tuning.log'):

    for i, tsk in enumerate(tasks):
        prefix = "[Task %2d/%2d] " % (i+1, len(tasks))

        # converting conv2d tasks to conv2d_NCHWc tasks
        op_name = tsk.workload[0]
        if op_name == 'conv2d':
            func_create = 'topi_x86_conv2d_NCHWc'
        elif op_name == 'depthwise_conv2d_nchw':
            func_create = 'topi_x86_depthwise_conv2d_NCHWc_from_nchw'
        else:
            raise ValueError("Tuning {} is not supported on x86".format(op_name))

        task = autotvm.task.create(func_create, args=tsk.args,
                                   target=target, template_key='direct')
        task.workload = tsk.workload

        # create tuner
        if tuner == 'xgb' or tuner == 'xgb-rank':
            tuner_obj = XGBTuner(task, loss_type='rank')
        elif tuner == 'ga':
            tuner_obj = GATuner(task, pop_size=50)
        elif tuner == 'random':
            tuner_obj = RandomTuner(task)
        elif tuner == 'gridsearch':
            tuner_obj = GridSearchTuner(task)
        else:
            raise ValueError("Invalid tuner: " + tuner)

        # do tuning
        n_trial=50#len(task.config_space)

        tuner_obj.tune(n_trial=n_trial,
                       early_stopping=early_stopping,
                       measure_option=measure_option,
                       callbacks=[
                           autotvm.callback.progress_bar(n_trial, prefix=prefix),
                           autotvm.callback.log_to_file(log_filename)])


########################################################################
# Finally, we launch tuning jobs and evaluate the end-to-end performance.

def tune_and_evaluate(tuning_opt):
    # extract workloads from nnvm graph
    print("Extract tasks...")
    net, params, data_shape, out_shape = get_network(model_name, batch_size)
    tasks = autotvm.task.extract_from_graph(net, target=target,
                                            shape={'data': data_shape}, dtype=dtype,
                                            symbols=(nnvm.sym.conv2d,))

    # run tuning tasks
    print("Tuning...")
    tune_kernels(tasks, **tuning_opt)

    # compile kernels with history best records
    with autotvm.apply_history_best(log_file):
        print("Compile...")
        with nnvm.compiler.build_config(opt_level=3):
            graph, lib, params = nnvm.compiler.build(
                net, target=target, shape={'data': data_shape}, params=params, dtype=dtype)

        # upload parameters to device
        ctx = tvm.cpu()
        data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))
        module = runtime.create(graph, lib, ctx)
        module.set_input('data', data_tvm)
        module.set_input(**params)

        # evaluate
        print("Evaluate inference time cost...")
        ftimer = module.module.time_evaluator("run", ctx, number=100, repeat=3)
        prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
        print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
              (np.mean(prof_res), np.std(prof_res)))
		
        lib.export_library("./deploy_tuned_lib.so")
        print('lib export succeefully')
        with open("./deploy_tuned_graph.json", "w") as fo:
            fo.write(graph.json())
        with open("./deploy_tuned_param.params", "wb") as fo:
            fo.write(nnvm.compiler.save_param_dict(params))

# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.

tune_and_evaluate(tuning_option)

######################################################################
# Sample Output
# -------------
# The tuning needs to compile many programs and extract feature from them.
# So a high performance CPU is recommended.
# One sample output is listed below.
#
# .. code-block:: bash
#
#    Extract tasks...
#    Tuning...
#    [Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.
#    [Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.
#    [Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.
#    [Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.
#    [Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.
#    [Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.
#    [Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.
#    [Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.
#    [Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.
#    [Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.
#    [Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.
#    [Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.
#    Compile...
#    Evaluate inference time cost...
#    Mean inference time (std dev): 3.16 ms (0.03 ms)

comaniac · September 11, 2019, 8:07pm

What performance do you expect to achieve or what’s the reference performance of you model?
Do you have tuning log like the sample output?

lufi1 · September 13, 2019, 7:59pm

Hi Cyrus,

Did you manage to solve your issue ?

Hi Comaniac, I followed similar step as Cyrus but do not see any performance increase.

To answer one of your question, I would expect to achieve similar result as in this paper https://arxiv.org/pdf/1905.00641.pdf ( 4.8. Inference Efficiency, table 5)

With the same CPU as in the paper, I can get only 89 ms for a frame in HD ( 1080x720) when in the paper it s achieving 50.3ms for a frame in Full HD ( which is more challenging and should take much longer).

Thanks
Lu

kevinthesun · September 14, 2019, 6:28am

If avx512 is available, it should be “llvm -mcpu=skylake-avx512”.