-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: data gen pipeline interface modifications #2173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…rom BaseDataGenPipeline - changed CoTDataGenerator to extend BaseDataGenPipeline, added input_data and output_path params - changed import_qa_data method to support different input formats - SelfInstructPipeline inherits from BaseDataGenPipeline for better data handling and output management. - seed loading accepts both file paths and direct lists of tasks. - functionality to save generated results if an output path is specified.
…and result saving - Updated BaseDataGenPipeline to include a results_key parameter for JSON output. - Refactored SelfImprovingCoTPipeline and EvolInstructPipeline to inherit from BaseDataGenPipeline, enabling consistent output management. - Added support for various input formats in EvolInstructPipeline and Source2SynthDataGenPipeline, allowing file paths, JSONL strings, and lists of prompts or texts. - Implemented result saving functionality across pipelines when an output path is specified.
…tions - Added batch_size, max_workers, and save_intermediate parameters to BaseDataGenPipeline and its subclasses for improved performance and flexibility. - Refactored CoTDataGenerator, SelfInstructPipeline, and EvolInstructPipeline to utilize new parameters, allowing for better resource management during data generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 20 out of 23 changed files in this pull request and generated 2 comments.
Files not reviewed (3)
- docs/camel.datagen.rst: Language not supported
- examples/datagen/evol_instruct/results.json: Language not supported
- examples/datagen/self_instruct/data_output.json: Language not supported
Comments suppressed due to low confidence (1)
camel/datagen/cot_datagen.py:162
- The variable 'chat_agent' is undefined in the constructor. Likely, the check should refer to 'generator_agent' or another appropriately defined parameter.
if chat_agent is not None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your great contribution @hesamsheikh ,left some comments
camel/datagen/base.py
Outdated
self.save_results(results) | ||
|
||
return results | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use @AbstractMethod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned the BaseDataGenPipeline generate method into abstractmethod
.
The execute
should remain non-abstract since it provides a useful default behavior by calling the generate
and saving the results.
Other methods are utility functions; better to remain non-abstract.
it seems that we may also need to modify the related cookbooks |
most of the cookbooks are for specific camel-ai versions. It's a good idea to update them. If you think it's necessary to modify them for this PR, let me know. |
…ai/camel into datagen-pipeline-refactor
Fixed the pre-commit errors, formatting issues, and variable type incompatibilities. |
Description
Fixes #1511. Refactors the data generation pipeline for better flexibility. Key changes include:
batch_size
: controls processing chunk sizemax_workers
: manages parallel processing resourcessave_intermediate
: enables checkpoint saving during processingresults_key
: specifies JSON output structureDocuments, tests, and examples are modified to account for this new interface.
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.Fixes #issue-number
in the PR description (required)pyproject.toml
anduv lock
If you are unsure about any of these, don't hesitate to ask. We are here to help!