Project Lantern is an ongoing effort to reduce the run time of Lighthouse and improve audit quality by modeling page activity and simulating browser execution. This document details the accuracy of these models and captures the expected natural variability.
All of the following accuracy stats are reported on a set of 300 URLs sampled from the Alexa top 1000, HTTPArchive dataset, and miscellaneous ad landing pages. Median was collected for 9 runs in one environment and compared to the median of 9 runs in a second environment.
Stats were collected using the trace-evaluation scripts. Table cells contain Spearman’s rho and MAPE for the respective metric.
| Comparison | FCP | FMP | TTI | | – | – | – | – | | Lantern predicting Default LH | .811 : 23.1% | .811 : 23.6% | .869 : 42.5% | | Lantern predicting LH on WPT | .785 : 28.3% | .761 : 33.7% | .854 : 45.4% |
| Comparison | FCP | FMP | TTI | | – | – | – | – | | Unthrottled LH predicting Default LH | .738 : 27.1% | .694 : 33.8% | .743 : 62.0% | | Unthrottled LH predicting WPT | .691 : 33.8% | .635 : 33.7% | .712 : 66.4% | | Default LH predicting WPT | .855 : 22.3% | .813 : 27.0% | .889 : 32.3% |
We conclude that Lantern is ~6-13% more inaccurate than DevTools throttling. When evaluating rank performance, Lantern achieves correlations within ~.04-.07 of DevTools throttling.
The reference stats demonstrate that there is high degree of variability with the user-centric metrics and strengthens the position that every load is just an observation of a point drawn from a distribution and to understand the entire experience, multiple draws must be taken, i.e. multiple runs are needed to have sufficiently small error bounds on the median load experience.
The current size of confidence intervals for DevTools throttled performance scores are as follows.