Skip to content

Discussion: HTTP Archive #46

@AlexanderKaran

Description

@AlexanderKaran

What data can and should we pull from the HTTP Archive?

Dump of messages from Discord from @anniesullie:


Hi everybody! I'm Annie, I work on Core Web Vitals and I've done a lot of analysis using HTTP Archive to find ways to speed up the web. I think there is huge potential to understand real-world metaframework usage and optimize the patterns that developers are using.

The big thing I wanted to point out is that the dashboard which shows CWV for HTTP Archive sites is just a very small part of HTTP Archive. HTTP Archive is a BigQuery dataset from a monthly crawl of over 15 million sites. The crawl runs Lighthouse, Wappalyzer technology detection, and WebPageTest. It provides a massive amount of fine-grained data on performance. For example, the Lighthouse JSON has all the User Timing data: https://developer.chrome.com/docs/lighthouse/performance/user-timings#how_lighthouse_reports_user_timing_data

And all the performance insights: https://developer.chrome.com/docs/performance/insights

You can see the docs on querying it at https://har.fyi/

So you could imagine taking the runtime metrics from https://github.com/e18e/framework-tracker/blob/main/initial-comparison-list.md#-runtime-performance and adding user timings or lighthouse audits for them, and getting a monthly dataset with lab runs of all those metrics for every site using them that has a decent amount of traffic. We've used this approach in the past to find consent management providers that slow down INP, and third party embeds that block bfcache. I think it would be really helpful to find the biggest problems and start burning them down.


Yeah, I think combining either lighthouse audits or user timing data with your metrics would give you some really great real-world data about how metaframeworks are used. For filtering your query, the best thing to do is to make sure that HTTP Archive's wappalyzer fork (https://github.com/HTTPArchive/wappalyzer) correctly reports the metaframeworks and their versions. Then you can filter for those using the technologies column in HTTP Archive: https://har.fyi/reference/tables/pages/#technologies
GitHub
GitHub - HTTPArchive/wappalyzer: HTTP Archive fork of Wappalyzer
HTTP Archive fork of Wappalyzer. Contribute to HTTPArchive/wappalyzer development by creating an account on GitHub.
har.fyi
Pages table
Reference docs for the httparchive.crawl.pages table
I haven't made many examples of how I use it public because my queries usually end with a list of "worst offenders" and sharing that data privately rather than publicly yeilds better results 🙂

But I'll go through and find some examples. A non-performance related one I documented was my talk at BlinkOn over the spring where I used HTTP Archive to find out why some APIs are adopted so quickly: https://www.youtube.com/watch?v=DdnFlyx0dU0&list=PL9ioqAuyl6UIpdsXtTngdETVXhOMxZHq2&index=2

Slides have speaker notes and links to Google Colabs where you can see the queries: https://docs.google.com/presentation/d/1dCBiu-GmGMAxw-X0Cgj7yP6PO6-eIAog6rs4qU_Dtmk/edit


Also I found this colab I made public which strings together a series of HTTP Archive queries. It's a bit esoteric of a topic (does INP correlate with TBT better than FID?) but the queries are pretty simple and it has examples of both Core Web Vitals and Lighthouse:

https://colab.research.google.com/drive/12lJmAABgyVjaUbmWvrbzj9BkkTxw6ay2#scrollTo=Kh58l3GlvRzx

I queried for end-user values we see for the sites for the INP and FID metrics, and then for lighthouse's TBT metric, which is supposed to help you improve the real-world ones. Then I queried HTTP archive for the lighthouse audit around meta viewport tags, because if you don't have one on a mobile site your users will see an automatic 300ms tap delay. Sites that fail that audit generally have low TBT since they're super old, but high INP/FID due to the tap delay. So you can see how I was able to run queries and put those numbers together into charts and graphs, and how you can actually see the tap delay on the scatter plot.


Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions