Skip to content
View sebastian-nagel's full-sized avatar

Block or report sebastian-nagel

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Common web archive utility code.

Java 54 72 Updated Mar 11, 2025

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).

Python 1,876 212 Updated Mar 11, 2025

Create a file tree with the raw data from a zip file in usable format

Python 1 1 Updated Oct 16, 2024

Distribute and run LLMs with a single file.

C++ 21,925 1,151 Updated Mar 11, 2025

Simple and flexible tool for managing secrets

Go 17,941 913 Updated Mar 10, 2025

DuckDB is an analytical in-process SQL database management system

C++ 27,274 2,144 Updated Mar 11, 2025

Scientific articles using or citing Common Crawl data

Jupyter Notebook 19 3 Updated Feb 28, 2025

A modern and functional monospaced typeface with a focus on legibility.

120 3 Updated May 16, 2023

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Python 528 26 Updated Jul 1, 2024

Fast and easy library for reading and writing WARC (Web Archive) files

C# 2 Updated May 15, 2024

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

C++ 3,410 239 Updated Aug 2, 2024

A library to mutate parquet files

Java 19 5 Updated May 9, 2023

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

Python 7,668 1,130 Updated May 22, 2024

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references und…

HTML 5,566 462 Updated Feb 20, 2025

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Python 438 65 Updated Mar 6, 2025

You know, an awesome list of search engines.

21 1 Updated Feb 7, 2025

Process Common Crawl data with Python and Spark

Python 422 88 Updated Feb 11, 2025

Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion …

Java 16 1 Updated May 25, 2017

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

Cython 1,226 72 Updated Feb 22, 2025

Add website scraping abilities to Datasette

Python 62 1 Updated Mar 4, 2023

Ansible playbook for deploying a Storm cluster

7 1 Updated Dec 7, 2023

Initial public release of code, data, and model weights for FourCastNet

Python 566 138 Updated Oct 2, 2023

Index Common Crawl archives in tabular format

Java 113 9 Updated Mar 10, 2025

Draw pretty maps from OpenStreetMap data! Built with osmnx +matplotlib + shapely

Jupyter Notebook 11,559 547 Updated Mar 4, 2025

Run various language identification tools on WET (Web archive extracted text) packages

Shell 1 Updated Jul 1, 2021

builds a tantivy index from common crawl warc.wet files

Rust 11 1 Updated Dec 22, 2024

Java library to check for multiple regexp with a single deterministic automaton. Just a wrapper around dk.brics.automaton really.

Java 71 29 Updated Mar 29, 2017

Run a high-fidelity browser-based web archiving crawler in a single Docker container

TypeScript 719 97 Updated Mar 5, 2025

Core Python Web Archiving Toolkit for replay and recording of web archives

JavaScript 1,467 226 Updated Mar 11, 2025
Next
Showing results