Table of contents
- 2. Refresher: Core Web Vitals metrics and thresholds
- 3. Criteria for the Core Web Vitals metric thresholds
- High-quality user experience
- Achievable by existing web content
- Final thoughts on criteria
- 4. Choice of percentile
- 5. Largest Contentful Paint
- Quality of experience
- 6. First Input Delay
- Quality of experience
- 7. Cumulative Layout Shift
- Quality of experience
3. Criteria for the Core Web Vitals metric thresholds
When creating benchmarks for Core Web Vitals metrics, Google developers had to first define the requirements that each threshold had to satisfy. The metrics used by Google to evaluate 2020 Core Web Vitals metric thresholds are explained below. The following parts will go into more detail on how these parameters were used to determine the thresholds for each metric in 2020. Google engineers foresee making changes and enhancements to the standards and thresholds in the coming years to further enhance our capacity to quantify great site user experiences.
High-quality user experience
The primary aim of developers at Google is to prepare for the customer and their level of experience. Given this, they need to ensure that pages that follow the Core Web Vitals “good” thresholds have a high-quality user interface.
Google engineers look at human cognition and HCI analysis to determine a threshold consistent with a high-quality user experience. Although this analysis is often described by a single fixed threshold, they discover that the underlying study is mostly expressed as a variety of values. For example, research on the amount of time users usually wait before losing attention is often presented as one second, despite the fact that the underlying research is articulated as a spectrum ranging from hundreds of milliseconds to several seconds. The fact that perception levels differ by user and context is backed up by aggregated and anonymized Chrome analytics results, which indicates that there is no single period of time users wait for a web page to load before abandoning it. These results, on the other hand, show a smooth and continuous distribution.
developers at Google use the set of values in the literature as an input to direct our threshold selection process where applicable customer experience research is available for a given metric and there is fair agreement on the range of values. When applicable user interface research isn’t usable, such as with a new metric like Cumulative Layout Change, they review real-world pages that exceed various candidate thresholds for the metric to find a threshold that produces a decent user experience.
Achievable by existing web content
Furthermore, in order for site owners to be effective in designing their pages to achieve the “healthy” benchmarks, it is required that these thresholds be realistic for current web material. For example, while zero milliseconds is an ideal LCP “ok” threshold, resulting in instant loading interactions, it is not realistic in most cases due to network and system processing latencies. As a result, an LCP “good” threshold of zero milliseconds is not a fair LCP “good” threshold for Core Web Vitals.
When assessing applicant Core Web Vitals “good” benchmarks, data from the Chrome User Experience Report is used to ensure that certain thresholds are realistic (CrUX). To confirm that a criterion is reachable, at least 10% of origins must actually meet the “good” threshold. Furthermore, to ensure that well-optimized sites are not misclassified due to inconsistency in field results, it is checked that well-optimized material passes the “good” threshold on a regular basis.
The “bad” threshold, on the other hand, is determined by identifying a level of performance that only a minority of origins actually achieve. Until there is data available to define a “poor” criterion, the worst-performing 10-30% of origins are categorized as “poor” by default.
Final thoughts on criteria
When determining candidate thresholds, Google engineers discovered that the parameters were often at odds with one another. For eg, there might be a trade-off between a threshold being regularly met and having consistently positive user experiences. Furthermore, since human perception analysis usually offers a variety of values and consumer behavior metrics display incremental improvements in behavior, they discovered that there is frequently no single “correct” threshold for a metric.
4. Choice of percentile
As previously said, Google developers use the 75th percentile value of all visitors to the page or site to classify the average success of a page or site. Two standards were used to determine the 75th percentile. To begin with, the percentile should ensure that the majority of visitors to a website or platform experienced the desired level of results. Outliers do not have a significant impact on the value at the selected percentile.
These goals are somewhat contradictory. A higher percentile is usually a safer alternative to meet the first goal. However, as the percentiles rise, the risk of outliers influencing the resulting value rises. It is not desirable for site classification to be determined by a few visits to a site that occur on flaky network connections, resulting in overly large LCP samples. For eg, if we were using the 95th percentile to evaluate the success of a site with 100 visits, it would only take 5 outlier tests for the 95th percentile value to be influenced by the outliers.
Given that these objectives are somewhat in conflict, it was concluded that the 75th percentile achieves an acceptable compromise. Using the 75th percentile, we can see that the majority of site visitors (3 of 4) met or exceeded the intended level of performance. Furthermore, the 75th percentile value is less likely to be influenced by outliers. To return to our example, for a site with 100 visitors, 25 of those visits would need to report big outlier samples in order for the value at the 75th percentile to be influenced by outliers. While outliers in 25 of 100 samples are feasible, they are far less common than in the 95th percentile situation.
5. Largest Contentful Paint
Quality of experience
One second is frequently mentioned as the length of time a user will wait before losing attention on a task. Following a detailed examination of relevant studies, it was discovered that 1 second is an approximation for describing a range of values ranging from few hundred milliseconds to several seconds.
Card et al. and Miller are two often-mentioned references for the 1-second barrier. Card, referencing Newell’s Unified Theories of Cognition, specifies a 1-second “immediate response” criterion. Immediate responses, according to Newell, are “responses that must be produced to some stimuli within very approximately one second (that is, generally from 0.3sec to 3sec).” This follows on from Newell’s description of “real-time limits on cognition,” in which it is said that “interactions with the environment that trigger cognitive considerations take place on the order of seconds,” ranging from around 0.5 to 2-3 seconds. Miller, another often-cited source for the one-second barrier, observes that “tasks which people can and will undertake with machine communications would fundamentally change their nature if response delays exceed two seconds, with some conceivable extension of another second or so.”
Miller and Card’s research defines the length of time a user would wait before losing concentration as a range, ranging from about 0.3 to 3 seconds, implying that the LCP “good” threshold should be in this range. Furthermore, given that the existing “good” threshold for First Contentful Paint is 1 second, and that Largest Contentful Paint often happens after First Contentful Paint, the range of potential LCP thresholds was limited to 1 second to 3 seconds. To determine which level in this range best fulfills the requirements, the achievability was examined of the potential thresholds listed below.
Using CrUX data, developers at Google were able to calculate the proportion of web sources that met the candidate LCP “good” criterion.
% of CrUX origins classified as “good” (for candidate LCP thresholds)
While less than 10% of origins reach the 1-second criteria, all other criteria from 1.5 to 3 seconds satisfy their criterion that at least 10% of origins reach the “good” level and so remain viable candidates.
Furthermore, in order to confirm that the specified threshold is consistently feasible for well-optimized sites, they examine LCP performance for top-performing sites throughout the web to discover which levels are regularly feasible for these sites. It was specifically to create a benchmark for high-performing sites that are routinely achieved at the 75th percentile. It was discovered that the 1.5 and 2 second limits are not regularly attained, however, 2.5 seconds is.
To identify a “poor” criteria for LCP, Google developers used CrUX data to determine criteria that were fulfilled by the majority of origins:
% of CrUX origins classified as “poor” (for candidate LCP thresholds)
For a 4 second criterion, around 26% of phone origins and 21% of desktop origins would be considered bad. This is within the intended range of 10-30%, thus Google’s developers determined that 4 seconds is an acceptable “bad” criterion.
As a result, it was determined that 2.5 seconds is a reasonable “good” criteria for Largest Contentful Paint, while 4 seconds is a reasonable “bad” level.
6. First Input Delay
Quality of experience
Visual feedback delays of up to 100 milliseconds are thought to be triggered by a related source, such as user input, according to research. As a result, a 100ms First Input Delay “good” threshold is generally reasonable as a minimum bar: if the delay for processing input exceeds 100ms, additional processing, and rendering stages have little chance of finishing on time.
Object A starts moving towards B. It comes to a halt when it comes into touch with B, after which B begins to move away from A.” Michotte alters the period between when Object A comes to a halt and when Object B begins to move. Participants believe that Object A causes the motion of Object B for delays of up to 100 milliseconds, according to Michotte. The impression of causality is ambiguous for delays between 100 and 200 milliseconds, and for durations, more than 200 milliseconds, the motion of Object B is no longer viewed as being produced by Object A.
Miller, too, establishes a response threshold for “The indication of action given, normally, by the movement of a key, switch, or other control members that signal it has been physically activated” is defined as “the indication of action given, normally, by the movement of a key, switch, or other control members that indicate it has been physically activated.” This response should be regarded as a component of the operator’s mechanical activity. “The delay between depressing a key and visual feedback should be no more than 0.1 to 0.2 seconds,” and subsequently, “the delay between depressing a key and visual feedback should be no more than 0.1 to 0.2 seconds.”
Kaaresoja et al. explored the perception of simultaneity between hitting a virtual button on a touchscreen and subsequent visual input showing the button was touched for various delays in Towards the Temporally Perfect Virtual Button. When the latency between button push and visual feedback was smaller than 85 milliseconds, participants stated that the visual feedback occurred 75 percent of the time simultaneously with the button push. Participants also reported consistently high perceived quality of the button push for delays of 100ms or less, with perceived quality decreasing off for delays of 100ms to 150ms and reaching very low values for delays of 300ms.
Based on the foregoing, it was determined that a range of values around 100ms would be an adequate First Input Delay threshold for Web Vitals. Furthermore, considering the low-quality levels indicated by users for delays of 300ms or more, 300ms appears to be a realistic “poor” criterion.
Using CrUX data, Google developers discovered that the vast majority of web sources fulfill the 100ms FID “good” standard at the 75th percentile:
% of CrUX origins classified as “good” for FID 100ms threshold
Furthermore, it was discovered that the top websites on the internet are constantly able to fulfill these criteria at the 75th percentile (and often meet it at the 95th percentile).
Given the foregoing, Google’s developers have determined that 100 milliseconds is a suitable “good” criterion for FID.
7. Cumulative Layout Shift
Quality of experience
CLS (Cumulative Layout Shift) is a new statistic that quantifies how much a page’s visible content shifts. Because CLS is so new, there isn’t much research that can directly inform the metric’s criteria. Real-world sites with varying amounts of layout shift were analyzed to establish the maximum amount of shift that is viewed as acceptable before producing substantial interruptions when reading page content, in order to create a threshold that is matched with user expectations. Changes of 0.15 and above were consistently rated as disturbing in their internal testing, whilst shifts of 0.1 and lower were detectable but not extremely disruptive.
Google’s developers were able to see that approximately half of all origins have CLS of 0.05 or less based on CrUX data.
% of CrUX origins classified as “good” (for candidate CLS thresholds)
While the CrUX data implies that 0.05 is a reasonable CLS “good” criterion, we acknowledge that in some usage situations, avoiding disruptive layout adjustments is currently impossible. For example, the height of third-party embedded material, such as social media embeds, is often unknown until it has finished loading, resulting in a layout shift of more than 0.05. As a result, while many sources achieve the 0.05 barrier, it was found that the slightly less severe CLS level of 0.1 achieves a superior balance between experience quality and achievability. Core Web Vitals is an acronym for “Core Web Vitals.” it is hoped that the web community will find methods to mitigate layout shifts caused by third-party embeds in the future, allowing Core Web Vitals to employ a more strict CLS “good” criterion of 0.05 or 0 in a future generation.
Additionally, CrUX data is used to select a “poor” CLS criteria that were satisfied by the majority of origins:
% of CrUX origins classified as “poor” (for candidate CLS thresholds)
Around 20% of phone origins and 18% of PC origins would be categorized as “poor” using a 0.25 standard. It was decided that 0.25 is an acceptable “poor” criteria because it fits within our desired range of 10-30%.