Model performance is much slower on Android Device using Java API, ComparedToRunEvaluator

I’m investigating a performance issue, so I want to open this thread to record my investigation result as well as to discuss with TVM experts here.

During auto-tuning, I can observe a excellent performance on my android device, e.g. 8ms inference latency when stable(the first several run is a little bit slow).

but when I use Java API to load and run the model according to the example of Android Deploy, performance become much worse, e.g. 50ms inference latency. even when I repeatedly run inference with same input like below, the latency is still not ideal, (about 25-30ms)

                    for (int j =0;  j < 1000; j++) {
                        runFunc.invoke();
                    }

Currently I have 2 questions:

  1. In current android deploy application(in MainActivity.java), following resource will be created and released in every ModelRunAsyncTask, whether they can be cached somewhere and reused for different image frame?
        NDArray inputNdArray = NDArray.empty(new long[]{1, IMG_CHANNEL, MODEL_INPUT_SIZE, MODEL_INPUT_SIZE}, new TVMType("float32"));
        NDArray outputNdArray = NDArray.empty(new long[]{1, 9}, new TVMType("float32"));
        Function setInputFunc = graphRuntimeModule.getFunction("set_input");
        Function runFunc = graphRuntimeModule.getFunction("run");
        Function getOutputFunc = graphRuntimeModule.getFunction("get_output");
  1. except this Java API, is there other way for me to run the compiled model for my android device? (e.g. currently I’m reading the code to figure out why RPCGetTimeEvaluator, used at the last step in auto-tuning, is super fast, and how it interacts with the compiled model. I really need to reproduce the measures inference latency of 8ms or a close one on my android application.)

This is indeed strange, cc @yzhliu

It would be great if you can look into it. The time evaluator is just a loop in c++ that repetitively run and measure the time. It will skip the first iteration though since that is usually the slowest one.

So I will expect that if you do the same for runFunc in java, the speed should be the same. If you are running OpenCL. You need to do

  • Run the first time: call ctx.sync() so first run finishes
  • Repetitively run your code
  • do another ctx.sync() and measure the time

One possibility is that the priority of the background Async task may be lower than what we use in RPC. Performance regressions were an issue in the development in of the RPC app (e.g., #1433), so we basically abuse the fact that Android runs the thread responsible for on-screen computation at highest priority by forking a “workhorse” activity to run the function in RPC. That’s why the APP spawns a new activity occasionally when one crashes—to prevent a bad function call from crashing the app since it is running as “UI” code.

The performance difference between running inference as an Async background task matches and plain RPC matches my experience.