sebastian-nagel

Follow

Sebastian Nagel sebastian-nagel

Follow

118 followers · 3 following

@commoncrawl
Konstanz, Germany
https://de.linkedin.com/pub/sebastian-nagel/35/320/8b4
https://orcid.org/0000-0002-3944-224X

Achievements

Achievements

Starred repositories

iipc / webarchive-commons

Common web archive utility code.

Java 54 72 Updated Mar 11, 2025

john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).

Python 1,876 212 Updated Mar 11, 2025

patrikaxelsson / zip2gz

Create a file tree with the raw data from a zip file in usable format

Python 1 1 Updated Oct 16, 2024

Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

C++ 21,925 1,151 Updated Mar 11, 2025

getsops / sops

Simple and flexible tool for managing secrets

Go 17,941 913 Updated Mar 10, 2025

duckdb / duckdb

DuckDB is an analytical in-process SQL database management system

C++ 27,274 2,144 Updated Mar 11, 2025

commoncrawl / cc-citations

Scientific articles using or citing Common Crawl data

Jupyter Notebook 19 3 Updated Feb 28, 2025

vaughantype / wumpus-mono

A modern and functional monospaced typeface with a focus on legibility.

120 3 Updated May 16, 2023

AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Python 528 26 Updated Jul 1, 2024

acidus99 / Warc.Net

Fast and easy library for reading and writing WARC (Web Archive) files

C# 2 Updated May 15, 2024

google / robotstxt

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).

C++ 3,410 239 Updated Aug 2, 2024

Factual / parquet-rewriter

A library to mutate parquet files

Java 19 5 Updated May 9, 2023

google / diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

Python 7,668 1,130 Updated May 22, 2024

swyxio / ai-notes

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references und…

HTML 5,566 462 Updated Feb 20, 2025

terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Python 438 65 Updated Mar 6, 2025

davidshq / awesome-search-engines

You know, an awesome list of search engines.

21 1 Updated Feb 7, 2025

commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark

Python 422 88 Updated Feb 11, 2025

RovoMe / JIRLbot

Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion …

Java 16 1 Updated May 25, 2017

rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

Cython 1,226 72 Updated Feb 22, 2025

cldellow / datasette-scraper

Add website scraping abilities to Datasette

Python 62 1 Updated Mar 4, 2023

DigitalPebble / ansible-storm

Ansible playbook for deploying a Storm cluster

7 1 Updated Dec 7, 2023

NVlabs / FourCastNet

Initial public release of code, data, and model weights for FourCastNet

Python 566 138 Updated Oct 2, 2023

commoncrawl / cc-index-table

Index Common Crawl archives in tabular format

Java 113 9 Updated Mar 10, 2025

marceloprates / prettymaps

Draw pretty maps from OpenStreetMap data! Built with osmnx +matplotlib + shapely

Jupyter Notebook 11,559 547 Updated Mar 4, 2025

spyysalo / langid-wet

Run various language identification tools on WET (Web archive extracted text) packages

Shell 1 Updated Jul 1, 2021

sebastian-nagel / WebArchiveWithParquetAvro

Forked from xw0078/WebArchiveWithParquetAvro

Scala 1 Updated Jan 29, 2020

ahcm / tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files

Rust 11 1 Updated Dec 22, 2024

fulmicoton / multiregexp

Java library to check for multiple regexp with a single deterministic automaton. Just a wrapper around dk.brics.automaton really.

Java 71 29 Updated Mar 29, 2017

webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

TypeScript 719 97 Updated Mar 5, 2025

webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

JavaScript 1,467 226 Updated Mar 11, 2025

Starred topics

spider

web-crawler

Crawler