Common Crawl Audio

Tools for downloading and preprocessing the Common Crawl Audio dataset.

Overview

This is a Python package for collecting and processing audio data from Common Crawl. The collected data is published on Hugging Face.

Dataset Statistics

Based on Whisper-AT tagging results (estimated from 1/10,000 subset):

Content Type	Ratio
Speech	63.9 %
Narration, monologue	9.5 %
Music	7.8 %
Male speech, man speaking	5.8 %
Others	13.0 %

License Notice

This repository is licensed under the Apache License 2.0, but this license applies only to the tools and code in this repository, not to the data downloaded using these tools. The downloaded data remains under the original licenses of their respective sources. Please be sure to check and comply with the licensing terms and conditions of each data source before using the downloaded data.

Requirements

uv (Python package manager)
Sufficient disk space (approximately 2 TB for Japanese audio only)

Setup

uv sync

Usage

1. Data Download

This tool downloads audio files using a pre-collected list of audio URLs that is stored on HuggingFace. This list was created by crawling RSS feeds and extracting audio URLs from them.

The tool reads this list from HuggingFace and downloads the actual audio files from their original sources. The downloaded audio data is saved in lhotse shar format.

cd src/ccaudio/ccaudio_downloader
uv run scrapy crawl ccaudio_spider -s SHAR_OUTPUT_DIR=/path/to/shar/dir/

Parameters:

SHAR_OUTPUT_DIR: Directory path to save downloaded audio in shar format

Note: This code is configured to download only items where the language column is ja, ja_JP, ja-jp, or ja-JP. The estimated download time with Japanese filtering is approximately 2-3 days. To change this filtering, edit the LANGUAGE_ITEMS setting in settings.py:

# Dataset settings
DATASET_NAME = "llm-jp/cc-audio-2025-18-rss"

# Set LANGUAGE_ITEMS=[] if you don't want to filter by language
LANGUAGE_ITEMS = ["ja", "ja_JP", "ja-jp", "ja-JP"]

2. Data Preprocessing

Process the downloaded data and convert it to a more usable format. The preprocessing includes:

Resampling
Denoising with demucs

uv run src/ccaudio/preprocess.py \
  --shar_dir /path/to/shar/dir \
  --output_dir /path/to/output/dir \
  --sr 16000

Parameters:

--shar_dir: Directory containing the downloaded shar files
--output_dir: Directory to save preprocessed audio in shar format

3. Using the Downloaded Data

See load_shar_sample.py for reference.

uv run src/ccaudio/load_shar_sample.py --shar_dir /path/to/shar/dir/

Citation

If you use this dataset or tools in your research, please cite:

@inproceedings{ccaudio2025,
  author    = {淺井 航平 and 杉浦 一瑳 and 中田 亘 and 栗田 修平 and 高道 慎之介 and 小川 哲司 and 東中 竜一郎},
  title     = {Common Crawlを用いた大規模音声音響データセットの構築},
  booktitle = {日本音響学会2025年秋季研究発表会},
  month     = {Sep.},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github		.github
src/ccaudio		src/ccaudio
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Common Crawl Audio

Overview

Dataset Statistics

License Notice

Requirements

Setup

Usage

1. Data Download

2. Data Preprocessing

3. Using the Downloaded Data

Citation

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

llm-jp/ccaudio

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Audio

Overview

Dataset Statistics

License Notice

Requirements

Setup

Usage

1. Data Download

2. Data Preprocessing

3. Using the Downloaded Data

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages