Skip to content

llm-jp/ccaudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Common Crawl Audio

Tools for downloading and preprocessing the Common Crawl Audio dataset.

Overview

This is a Python package for collecting and processing audio data from Common Crawl. The collected data is published on Hugging Face.

Dataset Statistics

Based on Whisper-AT tagging results (estimated from 1/10,000 subset):

Content Type Ratio
Speech 63.9 %
Narration, monologue 9.5 %
Music 7.8 %
Male speech, man speaking 5.8 %
Others 13.0 %

License Notice

This repository is licensed under the Apache License 2.0, but this license applies only to the tools and code in this repository, not to the data downloaded using these tools. The downloaded data remains under the original licenses of their respective sources. Please be sure to check and comply with the licensing terms and conditions of each data source before using the downloaded data.

Requirements

  • uv (Python package manager)
  • Sufficient disk space (approximately 2 TB for Japanese audio only)

Setup

uv sync

Usage

1. Data Download

This tool downloads audio files using a pre-collected list of audio URLs that is stored on HuggingFace. This list was created by crawling RSS feeds and extracting audio URLs from them.

The tool reads this list from HuggingFace and downloads the actual audio files from their original sources. The downloaded audio data is saved in lhotse shar format.

cd src/ccaudio/ccaudio_downloader
uv run scrapy crawl ccaudio_spider -s SHAR_OUTPUT_DIR=/path/to/shar/dir/

Parameters:

  • SHAR_OUTPUT_DIR: Directory path to save downloaded audio in shar format

Note: This code is configured to download only items where the language column is ja, ja_JP, ja-jp, or ja-JP. The estimated download time with Japanese filtering is approximately 2-3 days. To change this filtering, edit the LANGUAGE_ITEMS setting in settings.py:

# Dataset settings
DATASET_NAME = "llm-jp/cc-audio-2025-18-rss"

# Set LANGUAGE_ITEMS=[] if you don't want to filter by language
LANGUAGE_ITEMS = ["ja", "ja_JP", "ja-jp", "ja-JP"]

2. Data Preprocessing

Process the downloaded data and convert it to a more usable format. The preprocessing includes:

  • Resampling
  • Denoising with demucs
uv run src/ccaudio/preprocess.py \
  --shar_dir /path/to/shar/dir \
  --output_dir /path/to/output/dir \
  --sr 16000

Parameters:

  • --shar_dir: Directory containing the downloaded shar files
  • --output_dir: Directory to save preprocessed audio in shar format

3. Using the Downloaded Data

See load_shar_sample.py for reference.

uv run src/ccaudio/load_shar_sample.py --shar_dir /path/to/shar/dir/

Citation

If you use this dataset or tools in your research, please cite:

@inproceedings{ccaudio2025,
  author    = {淺井 航平 and 杉浦 一瑳 and 中田 亘 and 栗田 修平 and 高道 慎之介 and 小川 哲司 and 東中 竜一郎},
  title     = {Common Crawlを用いた大規模音声音響データセットの構築},
  booktitle = {日本音響学会2025年秋季研究発表会},
  month     = {Sep.},
  year      = {2025}
}

About

Tools for downloading and preprocessing audio data from Common Crawl

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •