Skip to content

Activity

feat(CCIndexWarcExport): increase number of retries fetching a WARC r…

sebastian-nagelpushed 1 commit to main • 08d441c…294d699 • 
25 days ago

build: update dependency versions

Force push
sebastian-nagelforce pushed to crawler-commons-dev • 2244bfd…096360b • 
25 days ago

Merge pull request #34 from commoncrawl/eot-archive-converter

Pull request merge
sebastian-nagelpushed 3 commits to main • fcbed8b…08d441c • 
on Nov 19, 2024

Add unit tests for EOT CDX-to-Parquet converter

sebastian-nagelcreated eot-archive-converter • 16741c0 • 
on Oct 29, 2024

Merge pull request #33 from commoncrawl/github-workflow

Pull request merge
sebastian-nagelpushed 4 commits to main • 8023ead…fcbed8b • 
on Oct 22, 2024

build: update unit test JVM args for Java 17 and 21

Force push
sebastian-nagelforce pushed to github-workflow • cf4bcde…faf6241 • 
on Oct 3, 2024

fix(javadoc): add missing param and return documentation

sebastian-nagelpushed 2 commits to main • e302756…8023ead • 
on Oct 3, 2024

build: disable Java 21 in Github workflow because not supported by Sp…

sebastian-nagelpushed 1 commit to github-workflow • 23cdb02…cf4bcde • 
on Oct 3, 2024

build: add Github workflow to verify proper compilation and build

sebastian-nagelcreated github-workflow • 23cdb02 • 
on Oct 3, 2024

Deleted branch

sebastian-nageldeleted 8-system-exit-stop-session • 
on Oct 2, 2024

Deleted branch

sebastian-nageldeleted 25-normalize • 
on Oct 2, 2024

filter out null lines/entries

jt55401pushed 7 commits to news-and-wat-wet-compatibilities • fc800a6…100e186 • 
on Sep 10, 2024

Roll back to commons-cli 1.2 to be compatible with Hadoop 3.3.4

Force push
sebastian-nagelforce pushed to crawler-commons-dev • 6f535e6…2244bfd • 
on Jul 16, 2024

Upgrade dependencies

sebastian-nagelpushed 2 commits to main • 21fd714…e302756 • 
on Jul 16, 2024

Add CDX-to-Parquet converter prototype for the end-of-term archive

sebastian-nagelcreated eot-archive • 8e0b776 • 
on May 31, 2024

Modified convert_url_index.sh to prefer user classes over those inclu…

jt55401created news-and-wat-wet-compatibilities • fc800a6 • 
on May 31, 2024

documentation

wumpuspushed 1 commit to main • 8d6bfbd…21fd714 • 
on Sep 19, 2023

2023 Sept/Oct crawl, 700k homepages

wumpuspushed 1 commit to main • 34bbb63…8d6bfbd • 
on Sep 19, 2023

random sample to create an extracsted warc

wumpuspushed 1 commit to main • 62a8ab7…34bbb63 • 
on Aug 27, 2023

Roll back to commons-cli 1.2 to be compatible with Hadoop 3.3.4

Force push
sebastian-nagelforce pushed to crawler-commons-dev • 4bda3e0…6f535e6 • 
on Aug 27, 2023

Upgrade dependencies

sebastian-nagelpushed 1 commit to main • 88b7d53…62a8ab7 • 
on Aug 27, 2023

Fix ordering of function parameters in example SQL query

sebastian-nagelpushed 1 commit to main • df856df…88b7d53 • 
on Aug 27, 2023

Bump guava from 31.1-jre to 32.0.0-jre

dependabot[bot]created dependabot/maven/com.google.guava-guava-32.0.0-jre • 0fe2418 • 
on Jun 14, 2023

Bump spark-core_2.12 from 3.3.2 to 3.4.0

Add link to Spark documentation to the README, add Trino to Presto

sebastian-nagelpushed 1 commit to main • f0add54…df856df • 
on Apr 6, 2023

Roll back to commons-cli 1.2 to be compatible with Hadoop 3.3.4

sebastian-nagelpushed 1 commit to crawler-commons-dev • 768f01f…4bda3e0 • 
on Apr 4, 2023

Use crawler-commons development version

Force push
sebastian-nagelforce pushed to crawler-commons-dev • d02de39…768f01f • 
on Apr 4, 2023

IndexTable: replace deprecated APIs (gson, commons-cli)

sebastian-nagelpushed 4 commits to main • 2dee94f…f0add54 • 
on Apr 4, 2023

IndexTable: replace deprecated APIs (gson, commons-cli)

sebastian-nagelcreated 25-normalize • 63f644d • 
on Apr 2, 2023

Use crawler-commons development version

Force push
sebastian-nagelforce pushed to crawler-commons-dev • 71556ec…d02de39 • 
on Apr 1, 2023