The easiest way to perform reshard #7379

yzhangcs · 2025-01-22T05:13:05Z

yzhangcs
Jan 22, 2025

Hi guys,
I'm wondering is there any general way to perform resharding for streaming datasets.
I'm facing some scnarios:

One single jsonl file is too large., e.g., >40G, which is not that ideal for data parallel.
When loaded in streaming mode, some dataset contains very few data shards. For example, HuggingfaceFW/sample-10BT caontains only 14 data files/shards, in this case, the data can not evenly distributed across all GPUs.

I wrote some loading scripts for jsonl data, which is not that elegant or general.

import glob
import orjson
import os

import datasets
from itertools import islice

_HOMEPAGE = "https://huggingface.co/datasets/m-a-p/Matrix"


class MatrixDataset(datasets.GeneratorBasedBuilder):
    """Custom dataset for JSON files with filtering capabilities."""

    VERSION = datasets.Version("1.0.0")

    def _info(self):
        return datasets.DatasetInfo(
            features=datasets.Features({
                "id": datasets.Value("string"),
                "text": datasets.Value("string"),
            }),
            homepage=_HOMEPAGE,
        )

    def _split_generators(self, dl_manager):
        """Returns SplitGenerators."""
        import random

        data_files = glob.glob("*/*.jsonl")
        data_shards = []
        for filepath in data_files:
            # max size of each shard is 1GB
            num_shards = -os.path.getsize(filepath) // -1024**3
            for i in range(num_shards):
                data_shards.append((filepath, i, num_shards))
        random.Random(42).shuffle(data_shards)

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "data_shards": data_shards,
                },
            ),
        ]

    def _generate_examples(self, data_shards):
        for file, split, num_shards in data_shards:
            with open(file, "r") as f:
                for i, line in islice(enumerate(f), split, None, num_shards):
                    data = orjson.loads(line)
                    if 'id' not in data:
                        data['id'] = f"{file}_{i}"
                    if 'content' in data and 'text' not in data:
                        data['text'] = data.pop('content')
                    if data['text'] is not None:
                        yield data["id"], data

I'm wondering if you could suggest any better ways @lhoestq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The easiest way to perform reshard #7379

{{title}}

Replies: 0 comments

Select a reply

The easiest way to perform reshard #7379

yzhangcs Jan 22, 2025

Replies: 0 comments

yzhangcs
Jan 22, 2025