HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.

Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main branch is used on each triggered pipeline run.

Crawl results

Tag: crawl_complete

httparchive.crawl.pages
httparchive.crawl.parsed_css
httparchive.crawl.requests

Core Web Vitals Technology Report

Tag: crux_ready

httparchive.core_web_vitals.technologies

Consumers:

HTTP Archive Tech Report

Blink Features Report

Tag: crawl_complete

httparchive.blink_features.features
httparchive.blink_features.usage

Consumers:

chromestatus.com - example

Schedules

crawl-complete PubSub subscription

Tags: ["crawl_complete"]
bq-poller-crux-ready Scheduler

Tags: ["crux_ready"]

Triggering workflows

In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

Contributing

Dataform development

Create new dev workspace in Dataform.
Make adjustments to the dataform configuration files and manually run a workflow to verify.
Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

Workspace hints

In workflow_settings.yaml set environment: dev to process sampled data.
For development and testing, you can modify variables in includes/constants.js, but note that these are programmatically generated.

Repository Structure

definitions/ - Contains the core Dataform SQL definitions and declarations
- output/ - Contains the main pipeline transformation logic
- declarations/ - Contains referenced tables/views declarations and other resources definitions
includes/ - Contains shared JavaScript utilities and constants
infra/ - Infrastructure code and deployment configurations
- dataform-trigger/ - Cloud Run function for workflow automation
- tf/ - Terraform configurations
- bigquery-export/ - BigQuery export configurations
docs/ - Additional documentation

Development Setup

Install dependencies:
```
npm install
```
Available Scripts:
- npm run format - Format code using Standard.js, fix Markdown issues, and format Terraform files
- npm run lint - Run linting checks on JavaScript, Markdown files, and compile Dataform configs

Code Quality

This repository uses:

Standard.js for JavaScript code style
Markdownlint for Markdown file formatting
Dataform's built-in compiler for SQL validation

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github		.github
definitions		definitions
docs		docs
includes		includes
infra		infra
.editorconfig		.editorconfig
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
workflow_settings.yaml		workflow_settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTTP Archive datasets pipeline

Pipelines

Crawl results

Core Web Vitals Technology Report

Blink Features Report

Schedules

Triggering workflows

Contributing

Dataform development

Workspace hints

Repository Structure

Development Setup

Code Quality

About

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

HTTPArchive/dataform

Folders and files

Latest commit

History

Repository files navigation

HTTP Archive datasets pipeline

Pipelines

Crawl results

Core Web Vitals Technology Report

Blink Features Report

Schedules

Triggering workflows

Contributing

Dataform development

Workspace hints

Repository Structure

Development Setup

Code Quality

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages