tldr: We are starting to look at expanding our support for benchmarking of TVM (and similar systems to make comparisons) in neo-ai/tvm. If other people are working on benchmarking, (including setting up an internal benchmarking capability (just as we already have for Sagemaker Neo) and thinking about how to make it open source), then we’d love to get in touch.
I see that there’s already been some discussion about benchmarking on the mailing list and informally, including defining a structured format for recording experiment results in a clear reproducible way and a concern about the load that benchmarking may impose on the Jenkins infrastructure.
To give a little bit of detail, what we are looking to have is something where it’s very easy to add benchmarks and very easy to get an easily interpretable before/after benchmark report on any given change to TVM for a limited (to manage load/time, e.g. 10-20) set of short-running benchmarks (say, ideally <1 minute each) on e.g. run time, code size, compile time and memory usage. This gets more tricky than it may sound for concerns like system load (for non-local runs), consistently getting very low levels of noise in the results (ideally you want noise < 0.5% at all times), supporting devices beyond CPUs remotely and easily being able to turn a benchmark into a test with real (not only random) data. The initial thought is to iterate on this quickly in the neo-ai/tvm repo so we can easily play around with it without concerns about destabilizing CI/CD upstream and also not be concerned about imposing load on upstream compute resources from our own benchmarking for the non-local mode of this. We do prefer this to be upstreamed and definitely want a shared solution. The initial plan is to prototype on this while getting in touch with other people who are interested and understanding what people are doing already, if anything - hence this email (I already know some people who have internal benchmark suites, like we do).
Also, anyone have an opinion on ck together with TVM for things like benchmarking? The initial plan here is not to use that, but ck seems like it has some good goals on this and maybe it should be used. There is ck-tvm already, though seems not to have been updated for a while.