This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main
branch is used on each triggered pipeline run.
Tag: crawl_complete
- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests
Tag: crux_ready
- httparchive.core_web_vitals.technologies
Consumers:
Tag: crawl_complete
- httparchive.blink_features.features
- httparchive.blink_features.usage
Consumers:
- chromestatus.com - example
-
crawl-complete PubSub subscription
Tags: ["crawl_complete"]
-
bq-poller-crux-ready Scheduler
Tags: ["crux_ready"]
In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
- Create new dev workspace in Dataform.
- Make adjustments to the dataform configuration files and manually run a workflow to verify.
- Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
- In
workflow_settings.yaml
setenvironment: dev
to process sampled data. - For development and testing, you can modify variables in
includes/constants.js
, but note that these are programmatically generated.
definitions/
- Contains the core Dataform SQL definitions and declarationsoutput/
- Contains the main pipeline transformation logicdeclarations/
- Contains referenced tables/views declarations and other resources definitions
includes/
- Contains shared JavaScript utilities and constantsinfra/
- Infrastructure code and deployment configurationsdataform-trigger/
- Cloud Run function for workflow automationtf/
- Terraform configurationsbigquery-export/
- BigQuery export configurations
docs/
- Additional documentation
-
Install dependencies:
npm install
-
Available Scripts:
npm run format
- Format code using Standard.js, fix Markdown issues, and format Terraform filesnpm run lint
- Run linting checks on JavaScript, Markdown files, and compile Dataform configs
This repository uses:
- Standard.js for JavaScript code style
- Markdownlint for Markdown file formatting
- Dataform's built-in compiler for SQL validation