Skip to content

Add push_to_hub method to upload datasets to Hugging Face Hub #139

@davanstrien

Description

@davanstrien

After generating a synthetic dataset with DataDesigner, sharing it on Hugging Face Hub requires manual conversion steps:

from datasets import Dataset

results = designer.create(...)
df = results.load_dataset()
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")

It might be nice to have a push_to_hub() method on DatasetCreationResults i.e. something like

results = designer.create(...)
results.push_to_hub("username/my-synthetic-dataset")

This would load directly from the parquet files (memory efficient for large datasets) and handle the upload.

Additional context

rough suggestion for an implementation

# In src/data_designer/interface/results.py

  def push_to_hub(
      self,
      repo_id: str,
      *,
      private: bool = False,
      token: str | None = None,
      commit_message: str | None = None,
  ) -> str:
      """Push the generated dataset to Hugging Face Hub.

      Args:
          repo_id: Repository ID (e.g., "username/dataset-name")
          private: Whether the dataset should be private
          token: Hugging Face token (uses cached token if not provided)
          commit_message: Custom commit message

      Returns:
          URL of the pushed dataset
      """
      from datasets import Dataset

      # Load directly from parquet - memory efficient for large datasets
      parquet_path = str(self.artifact_storage.final_dataset_path / "*.parquet")
      dataset = Dataset.from_parquet(parquet_path)

      return dataset.push_to_hub(
          repo_id,
          private=private,
          token=token,
          commit_message=commit_message,
      )

Happy to open a PR for this if it seems interesting!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions