Starred repositories
Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
Create a file tree with the raw data from a zip file in usable format
Distribute and run LLMs with a single file.
DuckDB is an analytical in-process SQL database management system
Scientific articles using or citing Common Crawl data
A modern and functional monospaced typeface with a focus on legibility.
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Fast and easy library for reading and writing WARC (Web Archive) files
The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references und…
A Python framework for performing information retrieval experiments, building on http://terrier.org/
You know, an awesome list of search engines.
Process Common Crawl data with Python and Spark
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion …
Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
Add website scraping abilities to Datasette
Ansible playbook for deploying a Storm cluster
Initial public release of code, data, and model weights for FourCastNet
Index Common Crawl archives in tabular format
Draw pretty maps from OpenStreetMap data! Built with osmnx +matplotlib + shapely
Run various language identification tools on WET (Web archive extracted text) packages
builds a tantivy index from common crawl warc.wet files
Java library to check for multiple regexp with a single deterministic automaton. Just a wrapper around dk.brics.automaton really.
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Core Python Web Archiving Toolkit for replay and recording of web archives