-
Notifications
You must be signed in to change notification settings - Fork 58
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
After generating a synthetic dataset with DataDesigner, sharing it on Hugging Face Hub requires manual conversion steps:
from datasets import Dataset
results = designer.create(...)
df = results.load_dataset()
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")It might be nice to have a push_to_hub() method on DatasetCreationResults i.e. something like
results = designer.create(...)
results.push_to_hub("username/my-synthetic-dataset")This would load directly from the parquet files (memory efficient for large datasets) and handle the upload.
Additional context
- Dependencies already present: datasets>=4.0.0, huggingface-hub>=0.34.4
- Similar tools like https://huggingface.co/docs/hub/datasets-distilabel have this integration
- Reference: https://huggingface.co/docs/hub/datasets-libraries
rough suggestion for an implementation
# In src/data_designer/interface/results.py
def push_to_hub(
self,
repo_id: str,
*,
private: bool = False,
token: str | None = None,
commit_message: str | None = None,
) -> str:
"""Push the generated dataset to Hugging Face Hub.
Args:
repo_id: Repository ID (e.g., "username/dataset-name")
private: Whether the dataset should be private
token: Hugging Face token (uses cached token if not provided)
commit_message: Custom commit message
Returns:
URL of the pushed dataset
"""
from datasets import Dataset
# Load directly from parquet - memory efficient for large datasets
parquet_path = str(self.artifact_storage.final_dataset_path / "*.parquet")
dataset = Dataset.from_parquet(parquet_path)
return dataset.push_to_hub(
repo_id,
private=private,
token=token,
commit_message=commit_message,
)Happy to open a PR for this if it seems interesting!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request