In 2017, the Software Development Team at King’s College London performed a benchmarking experiment to compare the warmup time and peak performance of modern programming language Virtual Machines (VMs). The experiment was intended to be the most rigorous to date. Our results were both surprising and disappointing. Not only did few modern VMs achieve a steady state of peak performance when running well known benchmarks, but some even slowed down over time.
This talk focuses not on the results of our experiment, but on our experiences of developing the “Krun” benchmarking system and the statistical analyses we used to process our data. The talk will discuss the difficulties we encountered in eliminating confounding variables and will show you how to present performance results in the absence of steady states.
Whilst Krun enabled us to collect robust and accurate results for our experiment, it tends towards being overkill. Ideally we’d like to build a cut-down version of Krun, but this raises the question of “which of Krun’s features make the most difference to benchmarking quality?”.