Tools for downloading and preprocessing the Common Crawl Audio dataset.
This is a Python package for collecting and processing audio data from Common Crawl. The collected data is published on Hugging Face.
Based on Whisper-AT tagging results (estimated from 1/10,000 subset):
Content Type | Ratio |
---|---|
Speech | 63.9 % |
Narration, monologue | 9.5 % |
Music | 7.8 % |
Male speech, man speaking | 5.8 % |
Others | 13.0 % |
This repository is licensed under the Apache License 2.0, but this license applies only to the tools and code in this repository, not to the data downloaded using these tools. The downloaded data remains under the original licenses of their respective sources. Please be sure to check and comply with the licensing terms and conditions of each data source before using the downloaded data.
- uv (Python package manager)
- Sufficient disk space (approximately 2 TB for Japanese audio only)
uv sync
This tool downloads audio files using a pre-collected list of audio URLs that is stored on HuggingFace. This list was created by crawling RSS feeds and extracting audio URLs from them.
The tool reads this list from HuggingFace and downloads the actual audio files from their original sources. The downloaded audio data is saved in lhotse shar format.
cd src/ccaudio/ccaudio_downloader
uv run scrapy crawl ccaudio_spider -s SHAR_OUTPUT_DIR=/path/to/shar/dir/
Parameters:
SHAR_OUTPUT_DIR
: Directory path to save downloaded audio in shar format
Note: This code is configured to download only items where the language
column is ja
, ja_JP
, ja-jp
, or ja-JP
. The estimated download time with Japanese filtering is approximately 2-3 days. To change this filtering, edit the LANGUAGE_ITEMS
setting in settings.py:
# Dataset settings
DATASET_NAME = "llm-jp/cc-audio-2025-18-rss"
# Set LANGUAGE_ITEMS=[] if you don't want to filter by language
LANGUAGE_ITEMS = ["ja", "ja_JP", "ja-jp", "ja-JP"]
Process the downloaded data and convert it to a more usable format. The preprocessing includes:
- Resampling
- Denoising with demucs
uv run src/ccaudio/preprocess.py \
--shar_dir /path/to/shar/dir \
--output_dir /path/to/output/dir \
--sr 16000
Parameters:
--shar_dir
: Directory containing the downloaded shar files--output_dir
: Directory to save preprocessed audio in shar format
See load_shar_sample.py for reference.
uv run src/ccaudio/load_shar_sample.py --shar_dir /path/to/shar/dir/
If you use this dataset or tools in your research, please cite:
@inproceedings{ccaudio2025,
author = {淺井 航平 and 杉浦 一瑳 and 中田 亘 and 栗田 修平 and 高道 慎之介 and 小川 哲司 and 東中 竜一郎},
title = {Common Crawlを用いた大規模音声音響データセットの構築},
booktitle = {日本音響学会2025年秋季研究発表会},
month = {Sep.},
year = {2025}
}