In Search of Accurate Benchmarking
In 2017, the Software Development Team at King’s College London performed a benchmarking experiment to compare the warmup time and peak performance of modern programming language Virtual Machines (VMs). The experiment was intended to be the most rigorous to date. Our results were both surprising and disappointing. Not only did few modern VMs achieve a steady state of peak performance when running well known benchmarks, but some even slowed down over time.
This talk focuses not on the results of our experiment, but on our experiences of developing the “Krun” benchmarking system and the statistical analyses we used to process our data. The talk will discuss the difficulties we encountered in eliminating confounding variables and will show you how to present performance results in the absence of steady states.
Whilst Krun enabled us to collect robust and accurate results for our experiment, it tends towards being overkill. Ideally we’d like to build a cut-down version of Krun, but this raises the question of “which of Krun’s features make the most difference to benchmarking quality?”.
| Slides (benchwork.pdf) | 2.56MiB | 
Wed 18 JulDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
| 11:00 - 12:30 | |||
| 11:0010m | Opening Remarks BenchWork | ||
| 11:1030m | Real World Benchmarks for JavaScript BenchWorkFile Attached | ||
| 11:4020m | In Search of Accurate Benchmarking BenchWork Edd Barrett King's College London, Sarah Mount King's College London, Laurence Tratt King's College LondonFile Attached | ||
| 12:0030m | AndroZoo: Lessons Learnt After 2 Years of Running a Large Android App Collection BenchWork Kevin Allix University of Luxembourg | ||


