You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys,
I'm wondering is there any general way to perform resharding for streaming datasets.
I'm facing some scnarios:
One single jsonl file is too large., e.g., >40G, which is not that ideal for data parallel.
When loaded in streaming mode, some dataset contains very few data shards. For example, HuggingfaceFW/sample-10BT caontains only 14 data files/shards, in this case, the data can not evenly distributed across all GPUs.
I wrote some loading scripts for jsonl data, which is not that elegant or general.
importglobimportorjsonimportosimportdatasetsfromitertoolsimportislice_HOMEPAGE="https://huggingface.co/datasets/m-a-p/Matrix"classMatrixDataset(datasets.GeneratorBasedBuilder):
"""Custom dataset for JSON files with filtering capabilities."""VERSION=datasets.Version("1.0.0")
def_info(self):
returndatasets.DatasetInfo(
features=datasets.Features({
"id": datasets.Value("string"),
"text": datasets.Value("string"),
}),
homepage=_HOMEPAGE,
)
def_split_generators(self, dl_manager):
"""Returns SplitGenerators."""importrandomdata_files=glob.glob("*/*.jsonl")
data_shards= []
forfilepathindata_files:
# max size of each shard is 1GBnum_shards=-os.path.getsize(filepath) //-1024**3foriinrange(num_shards):
data_shards.append((filepath, i, num_shards))
random.Random(42).shuffle(data_shards)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"data_shards": data_shards,
},
),
]
def_generate_examples(self, data_shards):
forfile, split, num_shardsindata_shards:
withopen(file, "r") asf:
fori, lineinislice(enumerate(f), split, None, num_shards):
data=orjson.loads(line)
if'id'notindata:
data['id'] =f"{file}_{i}"if'content'indataand'text'notindata:
data['text'] =data.pop('content')
ifdata['text'] isnotNone:
yielddata["id"], data
I'm wondering if you could suggest any better ways @lhoestq
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi guys,
I'm wondering is there any general way to perform resharding for streaming datasets.
I'm facing some scnarios:
streaming
mode, some dataset contains very few data shards. For example,HuggingfaceFW/sample-10BT
caontains only 14 data files/shards, in this case, the data can not evenly distributed across all GPUs.I wrote some loading scripts for jsonl data, which is not that elegant or general.
I'm wondering if you could suggest any better ways @lhoestq
Beta Was this translation helpful? Give feedback.
All reactions