[RFC] Benchmark Performance Log format

Workload record storage proposal

We propose a column-based data storage format that unifies how TVM results are tracked. For now the only metric we suggest tracking is execution time, with a focus on reproducibility. We foresees the need for a standard record format to pinpoint the most crucial target areas for future development in TVM.

Use cases and motivation

  • Easier cross-reference to other frameworks/runs
  • Experiment reproduction
  • Building nightly perf regression
  • Standardization will separate logging and visualization efforts

Key points to consider

  • Maintain reproducibility of experiments based solely on an output record
  • Granularity of metrics – e.g. per-layer vs end-to-end
  • config/misc json objects – enforce prescriptive keys or keep it flexible for now?
  • Extensibility of this schema towards desired features in the future

Proposed format

header examples category notes/justification
timestamp 1572282699.6 metadata Indicates when this record is logged
schema_version 0.1 metadata Ensure reproducibility as we iterate on this schema
additional_metadata { build_start:, host_env:, … } metadata Any data that will help identify/reproduce the build
workload ai.onnx.resnet-18 workload (1)
workload_args {{“input_shape”: (3, 224, 224)}, {“args_shape”: [list_of_shape]}, {“data_layout”: NHCW}} workload
engine tvm / onnx-runtime compiler
engine_version 0.5 / 996cf30e8d54b4bf58f0c9950475f47bba7e2c7e compiler Include either version or SHA, or both in the format “0.5:996cf3…”
engine_config {“llvm”: “llvm-8”, “nvcc”: 10.1} compiler
target llvm -mcpu=skylake-avx512 compiler compilation target, used for single platform
env_name aws.a1.large / rapi3b / iphone11pro hardware Readable cloud vendor category or hardware, mobile phone name
hw_model Intel Xeon E5-2686 v4 / bcm2837/ apple-a13 hardware Hardware type, soc type
runtime_config {“num_cpu_threads”: 3, “cudnn”: “cudnn-8”, “cuda_driver”: “480.10.1”, “os”: linux} hardware Backend runtime arguments
execution_config {“number”: 1, “repeat”: 10, “min_repeat_ms”, 0} statistics workload execution parameters
execution_elapsed_ms_mean float32 statistics
execution_elapsed_ms_std float32 statistics
  1. Namespacing: Use strings of the form “domain.name” where domain and name are any valid string except for the domains of ‘topi’ and ‘relay’ which are reserved to only refer to TOPI and Relay operator names respectively.
    For topi and relay namespaces, the operator name may be followed by an optional “:<SHA>” which corresponds to a particular git SHA of the TVM repository.

Possible future schema directions

  • Add more metrics to execution statistics, with statistics collection components enabled by some config object

Thank you @anwang for the proposal!

One question I have: do we want to couple the engine field with the “accelerator”/library that backs it? For instance, onnx-runtime has a handful of accelerators (https://github.com/microsoft/onnxruntime#high-performance: MLAS, CUDA, MKL-ML, MKL-DNN, nGraph, TensorRT, OpenVINO, Nuphar/TVM, DirectML, ACL) - do we want to add an additional library/accelerator field to the column based schema or merge it with the engine, e.g. onnxruntime-tensorrt, tensorflow-XLA etc.?

i think the accel lib configs you mentioned can be part of the engine_config

would be great to get broad comments from people who runs benchmarks cc @ajtulloch @zhiics @MarisaKirisame @ramana-arm everyone is welcomed to comment

In order to reproduce the results, we may also need to record the schedule configs used by each layer in case the result was tuned by AutoTVM.

Do we also need to save the latest commit id for the execution so that we know which commits are causing perf regression? Also, what would be the followed actions if there is perf. regression? Will someone receive an email?

@broune do you have any comments?

We are also very interested in a more expanded benchmarking (and testing) story. What infrastructure would produce, ingest, track and display information from this format?

Three top other metrics would be code size, compile time and device memory usage.

Thanks for great discussions. I think it makes sense to first formalize the standard log format, then we can work on separate modules that produce and display the information.

code_size is a good metric to have. the only problem about compile_time is that it can be very platform dependent and whether or not autotvm is involved.

device_memory_usage is nice, but it is a bit hard to get for every type of backend device, it is also going to be relatively stable, unless a very different algorithm was used.

We could use a misc field for additional information(perhaps rename additional_metadata?)(e.g. compile time, and autotvm setting @comaniacmention )

The commit id that @zhiics mentioned is covered by engine_version.

If this is run in a Docker container, which seems like a desirable thing, then it would be good to record the tag and hash of the Docker container (or is this included in “host_env”?).

Device memory usage can vary substantially according to optimizations like fusion, rematerialization, ordering of ops, simplification of ops (did that training statistics op get optimized out or not?), exact way that parallelization is done etc. This is critical for training and I’d think also important for inference, especially on small devices. If it is stable, then whatever system displays this data can simply summarize as “no change to device memory usage”. For static graphs this can be calculated from Relay statically by the compiler. For dynamic graphs on hard-to-inspect systems I agree it might be missing, though that’s OK I think - better to be able to notice e.g. a memory-ballooning Relay change from the cases that do have this metric than not being able to notice this.

I’ll expect that there will be many more metrics than these over time, e.g. compiler memory usage is also important (but not in my top 4, so I didn’t mention it) and we’ll want a hash of the Relay graph and produced artifacts (omitting timestamps if we really need those in the produced artifacts) to track down non-determinism of the compiler or model definition and to ensure that a comparison really was comparing the same thing. For data-dependent models (dynamic shapes are a special case of that), we’d also need to know what the input was.

I can’t edit the original post, but here are the amendments and clarifications based on everyone’s comments:

  1. additional_metadata is renamed to additional_info.

  2. engine_config will contain an optional “accelerator” field specifying the name of the library e.g.
    engine_config = {"llvm": "llvm-8", "nvcc": 10.1, "accelerator": "MLAS"}

  3. workload_args.schedule_configs is an array of each layer’s schedule config (for AutoTVM only). @comaniac Do you have any suggestions for what this should look like?

  4. code_size and device_memory_usage and compile_time are optionally specified in additional_info.

  5. @zhiics We’ll leave it up to contributors to monitor the perf regression on commits they contributed for now, and postpone the discussion of notification.

  6. docker_tag and docker_hash will be required fields in additional_info.host_env. All experiments will be run in a docker container.

  7. I suggest encoding records in csv format for now. To be pedantic, here is a single record:
    1572282699.6, 0.1, {build_start:1572282699, commit_id: 996cf30e8d54b4bf0c9950475f47bba7e2fc, host_env: {docker_tag: "nightly_minimal_perf", docker_hash: 996cf30e8d54b4bf0c9950475f47bba7e2c7e}}, ai.onnx.resnet-18, {{“input_shape”: (3, 224, 224)}, {“args_shape”: [list_of_shape]}, {“data_layout”: NHCW}}, tvm, 0.5, {“llvm”: “llvm-8”, “nvcc”: 10.1}, llvm -mcpu=skylake-avx512, aws.a1.large, Intel Xeon E-2124 Coffee Lake BX80684E2124, {“num_cpu_threads”: 3, “cudnn”: “cudnn-8”, “cuda_driver”: “480.10.1”, “os”: linux}, {“number”: 1, “repeat”: 10, “min_repeat_ms”, 0}, 243, 10.5
    I can DM more mocked records to anyone who’s interested. Down the line we’ll move to parquet to be more compressible.

@broune We’ll put these additional metrics (including device_memory_usage, code_size, and compile_time) optionally in additional_info as they crop up, and continue iterating on this schema accordingly.

For the schedule config, it would be a map structure depending on a schedule template. For example, here is one config for a conv2d with Winograd algorithm on CUDA:

{"t": "winograd", 
 "e": [["tile_b", "sp", [-1, 1, 1, 1]],
       ["tile_y", "sp", [-1, 1, 16, 2]],
       ["tile_x", "sp", [-1, 1, 16, 1]],
       ["tile_rc", "sp", [-1, 32]],
       ["auto_unroll_max_step", "ot", 1500],
       ["unroll_explicit", "ot", 0]]}

Note that the config entry number and name (such as tile_b) may be vary in different schedule templates. Here is another example of a conv2d with Winograd algorithm on ARM CPU:

{"t": "winograd", 
 "e": [["tile_p", "sp", [-1, 1]],
       ["tile_k", "sp", [-1, 16]],
       ["tile_c", "sp", [-1, 2]],
       ["ann_reduce", "an", "none"],
       ["ann_spatial", "an", "unroll"]]}

Looks good, I’d also add that under additional_info we will want to track validation accuracy in order to navigate performance/accuracy tradeoffs down the road. We’ll need to think of a way to standardize accuracy measurement down the road; something to think about.

@comaniac for reproducibility sake I imagine that we’ll also want to track compilation parameters like optimization level etc. We can imagine that there will also be different graph-level transformations that will affect performance; e.g. altering the data layout; applying quantization etc.

Therefore one of the complexities of having a usable log format is to separate the data contained in the log, from metadata required to reproduce the run (compilation “strategy”, per layer scheduling params etc.)

Make sense. Then I would suggest to change schedule.configs to build_configs. It could be a map with entries you mentioned plus the schedule config for each layer. For example:

build_configs = {
  "opt_level": 3,
  "layer_schedules": # Original schedule_configs array

We should further decide the info we really need for this field.

For the log complexity, I am actually fine with both. Even we keep everything in a log file, we can have a simple parser to display the part we need like the AWS SDK.

Thanks everyone’s discussion. Please share your comments and see if you agree with @anwang’s latest update. Then we can move it to an official RFC on the github.

I think the logging format specification will be an evolving process but it would be a good starting point for the community to share a common benchmarking infrastructure.

More clarifications:

  1. validation_accuracy in additional_info is an optional field.

  2. Base-level header build_configs instead of workload_args.schedule_configs will contain compilation parameters such as opt_level and optional layer_schedules. For now, layer_schedules will use the schedule template idea proposed by @comaniac, and I’m leaving it optional to support both simple/complex log format strategies.

Thanks for the suggestions, everyone. Any last comments before we move the discussion to github?

Sorry just noticed this - I’ll try and catch up with this tomorrow.

Hi everyone, based on some initial data aggregation runs and feedback, I’ve consolidated log format amendments to https://github.com/apache/incubator-tvm/issues/4304. We’ll move it to the tvm official docs shortly.