Skip to content

feat: data gen pipeline interface modifications #2173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

hesamsheikh
Copy link
Collaborator

Description

Fixes #1511. Refactors the data generation pipeline for better flexibility. Key changes include:

  • CoTDataGenerator, SelfInstructPipeline, SelfImprovingCoTPipeline, and EvolInstructPipeline inherit from BaseDataGenPipeline for a unified interface.
  • Added parameters for better pipeline control:
    batch_size: controls processing chunk size
    max_workers: manages parallel processing resources
    save_intermediate: enables checkpoint saving during processing
    results_key: specifies JSON output structure
  • Added support for multiple input formats (file paths, JSONL strings, lists of prompts or texts)
  • standardized result saving functionality when output paths are specified.
  • JSON parsing, error handling, and added logging for better debugging and monitoring.

Documents, tests, and examples are modified to account for this new interface.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

…rom BaseDataGenPipeline

- changed CoTDataGenerator to extend BaseDataGenPipeline, added input_data and output_path params
- changed import_qa_data method to support different input formats
- SelfInstructPipeline inherits from BaseDataGenPipeline for better data handling and output management.
- seed loading accepts both file paths and direct lists of tasks.
- functionality to save generated results if an output path is specified.
…and result saving - Updated BaseDataGenPipeline to include a results_key parameter for JSON output. - Refactored SelfImprovingCoTPipeline and EvolInstructPipeline to inherit from BaseDataGenPipeline, enabling consistent output management. - Added support for various input formats in EvolInstructPipeline and Source2SynthDataGenPipeline, allowing file paths, JSONL strings, and lists of prompts or texts. - Implemented result saving functionality across pipelines when an output path is specified.
…tions - Added batch_size, max_workers, and save_intermediate parameters to BaseDataGenPipeline and its subclasses for improved performance and flexibility. - Refactored CoTDataGenerator, SelfInstructPipeline, and EvolInstructPipeline to utilize new parameters, allowing for better resource management during data generation.
@hesamsheikh hesamsheikh added enhancement New feature or request Refactor call for contribution P0 Task with high level priority labels Apr 13, 2025
@hesamsheikh hesamsheikh added this to the Sprint 22 milestone Apr 13, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 20 out of 23 changed files in this pull request and generated 2 comments.

Files not reviewed (3)
  • docs/camel.datagen.rst: Language not supported
  • examples/datagen/evol_instruct/results.json: Language not supported
  • examples/datagen/self_instruct/data_output.json: Language not supported
Comments suppressed due to low confidence (1)

camel/datagen/cot_datagen.py:162

  • The variable 'chat_agent' is undefined in the constructor. Likely, the check should refer to 'generator_agent' or another appropriately defined parameter.
if chat_agent is not None:

Copy link
Collaborator

@zjrwtx zjrwtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your great contribution @hesamsheikh ,left some comments

self.save_results(results)

return results

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use @AbstractMethod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned the BaseDataGenPipeline generate method into abstractmethod.
The execute should remain non-abstract since it provides a useful default behavior by calling the generate and saving the results.
Other methods are utility functions; better to remain non-abstract.

@zjrwtx
Copy link
Collaborator

zjrwtx commented Apr 13, 2025

it seems that we may also need to modify the related cookbooks

@hesamsheikh
Copy link
Collaborator Author

it seems that we may also need to modify the related cookbooks

most of the cookbooks are for specific camel-ai versions. It's a good idea to update them. If you think it's necessary to modify them for this PR, let me know.

@MuggleJinx MuggleJinx self-requested a review April 13, 2025 12:57
@hesamsheikh
Copy link
Collaborator Author

Fixed the pre-commit errors, formatting issues, and variable type incompatibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P0 Task with high level priority Refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Polish Interface of data generation pipleline
3 participants