Efforts on benchmarking for TVM

tldr: We are starting to look at expanding our support for benchmarking of TVM (and similar systems to make comparisons) in neo-ai/tvm. If other people are working on benchmarking, (including setting up an internal benchmarking capability (just as we already have for Sagemaker Neo) and thinking about how to make it open source), then we’d love to get in touch.

I see that there’s already been some discussion about benchmarking on the mailing list and informally, including defining a structured format for recording experiment results in a clear reproducible way and a concern about the load that benchmarking may impose on the Jenkins infrastructure.

To give a little bit of detail, what we are looking to have is something where it’s very easy to add benchmarks and very easy to get an easily interpretable before/after benchmark report on any given change to TVM for a limited (to manage load/time, e.g. 10-20) set of short-running benchmarks (say, ideally <1 minute each) on e.g. run time, code size, compile time and memory usage. This gets more tricky than it may sound for concerns like system load (for non-local runs), consistently getting very low levels of noise in the results (ideally you want noise < 0.5% at all times), supporting devices beyond CPUs remotely and easily being able to turn a benchmark into a test with real (not only random) data. The initial thought is to iterate on this quickly in the neo-ai/tvm repo so we can easily play around with it without concerns about destabilizing CI/CD upstream and also not be concerned about imposing load on upstream compute resources from our own benchmarking for the non-local mode of this. We do prefer this to be upstreamed and definitely want a shared solution. The initial plan is to prototype on this while getting in touch with other people who are interested and understanding what people are doing already, if anything - hence this email (I already know some people who have internal benchmark suites, like we do).

Also, anyone have an opinion on ck together with TVM for things like benchmarking? The initial plan here is not to use that, but ck seems like it has some good goals on this and maybe it should be used. There is ck-tvm already, though seems not to have been updated for a while.

2 Likes

Thanks for bringing up the proposal. Given that benchmarking is something that a lot of folks in the community cares about, it would be great if we can have a RFC discussion about the topic and coordinate the effort :slight_smile: – or turn this thread into an RFC

Here are some collections of my initial thoughts

As a part of the initial goal, we could aim for nightly benchmarks(e.g. to improve on https://github.com/apache/incubator-tvm/tree/master/tests/python/nightly/), which will not pose as much load on the CI infrastructure, and could be used as health metric for the project overall.

Here are things that i think should be discussed and developed, please feel free to add more:

  • Helper utils for logging and validating the benchmark log formats
  • Set of visualization that can help display the health metrics
  • Standard ways of setting up the tracker, and pushing these information to the jenkins worker.

I definitely agree that nightly benchmarks are important. Another key use-case that we’re aiming to iterate on with a simple prototype is this kind of interaction: (note that these parameters could typically all be inferred):

  compare_benchmarks local/path/or/url/to/tvm/repo HEAD~5 HEAD resnet50 maskrcnn bert
  <wait a while, ideally less than 10 minutes>
  <get report comparing HEAD~5 to HEAD on those benchmarks across metrics>

The key is that this 1-step way of getting a performance report on a change is super easy, barely an inconvenience, so there’s a low barrier to use and it automates something that takes up time today (or isn’t done). It doesn’t require a pull request or even publishing to github. An initial prototype can run locally, which matches how I understand benchmarks are run today and it can work without setting up any infrastructure. (There might also be a path with a similar interaction for running tests down the line.)

One more general concern is that down the line it would be good to be able to run non-TVM benchmarks, so that it’s as easy to compare TVM with other systems as with itself.

How is auto-tuning going to factor into this? Is the intention to only use TopHub or to include an auto-tuning step as part of the benchmarking? I mention this because one of the most direct ways to improve performance is to write new auto-tuning schedules, but actually performing that tuning takes a lot of time and resources.

tldr: I think we should aim at invalidating the premise that auto-tuning is always required, but in the mean time we’ll have to deal with it. This issue is not different from what happens today when people run e.g. a single benchmark in an ad-hoc way to figure out if their change is actually making things faster.

Auto-tuning is one of the things that we do with benchmarks, so a benchmarking system needs to integrate with that, definitely. I wasn’t planning to include that in v-1 of an initial prototype, though, instead focusing on other aspects initially.

If we accept the premise of a situation where tuning a model takes a day or longer, and we want numbers for at least 10 models, then that may be too slow or too much load even for a once-a-day run, but maybe could be OK for a weekly or monthly mode (and there might be signals to see if re-tuning a model is necessary, but then this reduces reproducibility). It’s definitely too slow, and too much load, for just quickly getting numbers for a single change. So yes, this premise leads to leaning on e.g. TopHub, with an option to provide a custom tuning file. It helps that we normally only auto-tune conv and dense, but if there is a Relay change that changes the conv or dense ops in a model (though this should be fairly rare, but it can happen), then the benchmark comparison would correctly (!) show a large slow-down - something that clearly should be warned about by the benchmark runner to immediately explain the regression, but this is still valuable information. We in fact see regressions for this kind of reason when making release of Neo, and then have to deal with it, delaying the release, so getting these signals sooner is better.

The premise above is that tuning is very slow (e.g. 1+ days) and that measurements without tuning are so bad as to be irrelevant (e.g. 10x slower). This premise causes problems not just for benchmarking, but also for end-users of TVM if they don’t want to wait for such a long time to compile (and this will become more common with training). So quite apart from concerns with benchmarking, I’ve been intending to make a proposal around having the option for manual cost models, better tuning algorithms and manageable search spaces (if we only tune conv and dense anyway, then this is more manageable) - I’m also happy with, actually would prefer, ML models for that as long as it works well enough (and there can be combinations of the two). Then long-running tuning would gain a second use for very conveniently identifying problems with the shorter-running heuristic, while retaining the first use case of (at that point optionally) fully-optimizing for a specific model. Though this is getting off-topic from this thread.

1 Like