-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
What data can and should we pull from the HTTP Archive?
Dump of messages from Discord from @anniesullie:
Hi everybody! I'm Annie, I work on Core Web Vitals and I've done a lot of analysis using HTTP Archive to find ways to speed up the web. I think there is huge potential to understand real-world metaframework usage and optimize the patterns that developers are using.
The big thing I wanted to point out is that the dashboard which shows CWV for HTTP Archive sites is just a very small part of HTTP Archive. HTTP Archive is a BigQuery dataset from a monthly crawl of over 15 million sites. The crawl runs Lighthouse, Wappalyzer technology detection, and WebPageTest. It provides a massive amount of fine-grained data on performance. For example, the Lighthouse JSON has all the User Timing data: https://developer.chrome.com/docs/lighthouse/performance/user-timings#how_lighthouse_reports_user_timing_data
And all the performance insights: https://developer.chrome.com/docs/performance/insights
You can see the docs on querying it at https://har.fyi/
So you could imagine taking the runtime metrics from https://github.com/e18e/framework-tracker/blob/main/initial-comparison-list.md#-runtime-performance and adding user timings or lighthouse audits for them, and getting a monthly dataset with lab runs of all those metrics for every site using them that has a decent amount of traffic. We've used this approach in the past to find consent management providers that slow down INP, and third party embeds that block bfcache. I think it would be really helpful to find the biggest problems and start burning them down.
Yeah, I think combining either lighthouse audits or user timing data with your metrics would give you some really great real-world data about how metaframeworks are used. For filtering your query, the best thing to do is to make sure that HTTP Archive's wappalyzer fork (https://github.com/HTTPArchive/wappalyzer) correctly reports the metaframeworks and their versions. Then you can filter for those using the technologies column in HTTP Archive: https://har.fyi/reference/tables/pages/#technologies
GitHub
GitHub - HTTPArchive/wappalyzer: HTTP Archive fork of Wappalyzer
HTTP Archive fork of Wappalyzer. Contribute to HTTPArchive/wappalyzer development by creating an account on GitHub.
har.fyi
Pages table
Reference docs for the httparchive.crawl.pages table
I haven't made many examples of how I use it public because my queries usually end with a list of "worst offenders" and sharing that data privately rather than publicly yeilds better results 🙂
But I'll go through and find some examples. A non-performance related one I documented was my talk at BlinkOn over the spring where I used HTTP Archive to find out why some APIs are adopted so quickly: https://www.youtube.com/watch?v=DdnFlyx0dU0&list=PL9ioqAuyl6UIpdsXtTngdETVXhOMxZHq2&index=2
Slides have speaker notes and links to Google Colabs where you can see the queries: https://docs.google.com/presentation/d/1dCBiu-GmGMAxw-X0Cgj7yP6PO6-eIAog6rs4qU_Dtmk/edit
Also I found this colab I made public which strings together a series of HTTP Archive queries. It's a bit esoteric of a topic (does INP correlate with TBT better than FID?) but the queries are pretty simple and it has examples of both Core Web Vitals and Lighthouse:
https://colab.research.google.com/drive/12lJmAABgyVjaUbmWvrbzj9BkkTxw6ay2#scrollTo=Kh58l3GlvRzx
I queried for end-user values we see for the sites for the INP and FID metrics, and then for lighthouse's TBT metric, which is supposed to help you improve the real-world ones. Then I queried HTTP archive for the lighthouse audit around meta viewport tags, because if you don't have one on a mobile site your users will see an automatic 300ms tap delay. Sites that fail that audit generally have low TBT since they're super old, but high INP/FID due to the tap delay. So you can see how I was able to run queries and put those numbers together into charts and graphs, and how you can actually see the tap delay on the scatter plot.