feat: add allow_resize for 1:N and N:1 generation patterns#286
feat: add allow_resize for 1:N and N:1 generation patterns#286andreatgretel merged 17 commits intomainfrom
Conversation
3c9fa49 to
8ba264c
Compare
0cedd56 to
8c9a33e
Compare
packages/data-designer-config/src/data_designer/config/column_configs.py
Outdated
Show resolved
Hide resolved
Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests
- Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns
d4a6100 to
d84d799
Compare
- Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract)
d84d799 to
1a16906
Compare
Greptile SummaryThis PR adds
|
| Filename | Overview |
|---|---|
| docs/concepts/custom_columns.md | Adds allow_resize documentation with FULL_COLUMN and CELL_BY_CELL examples. The FULL_COLUMN example uses wrong parameter name params instead of generator_params. |
| packages/data-designer-config/src/data_designer/config/base.py | Adds allow_resize: bool = False to SingleColumnConfig base class. Clean, minimal change that enables all subclasses to inherit the field. |
| packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py | Adds cell-by-cell resize support: generate() returns `dict |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py | Core resize orchestration: adds _cell_resize_mode / _cell_resize_results for cell-by-cell buffer collection, resize logging, and flattening in _finalize_fan_out. Uses getattr for MultiColumnConfig compatibility in full-column path. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py | Consolidates update_records and old replace_buffer into single replace_buffer(records, *, allow_resize=False) method. Adds _actual_num_records tracking and metadata field. Clean API simplification. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py | Updates run_pre_batch to use replace_buffer(..., allow_resize=True) since processors can change record count. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py | Thorough integration tests: parametrized preview and multi-batch build tests for cell filter, cell expand, cell drop-all, full-column expand, full-column chain, and mixed cell+full chain. Validates row counts and ordering. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ColumnWiseDatasetBuilder._run_batch] --> B{Generation Strategy?}
B -->|FULL_COLUMN| C[_run_full_column_generator]
B -->|CELL_BY_CELL| D[_run_cell_by_cell_generator]
C --> C1[generator.generate DataFrame]
C1 --> C2{allow_resize?}
C2 -->|Yes| C3[replace_buffer allow_resize=True\nupdates _num_records_list]
C2 -->|No| C4[replace_buffer allow_resize=False\nstrict 1:1 check]
D --> D1[_setup_fan_out]
D1 --> D2{allow_resize?}
D2 -->|Yes| D3[Init _cell_resize_results\n_cell_resize_mode = True]
D2 -->|No| D4[_cell_resize_mode = False]
D3 --> E[Fan out workers threads/async]
D4 --> E
E --> F[_worker_result_callback]
F --> F1{_cell_resize_mode?}
F1 -->|Yes| F2[Store dict/list in\n_cell_resize_results index]
F1 -->|No| F3[update_record in buffer]
E --> G[_finalize_fan_out]
G --> G1{_cell_resize_mode?}
G1 -->|Yes| G2[Flatten results\nskip _records_to_drop\nreplace_buffer allow_resize=True]
G1 -->|No| G3{_records_to_drop?}
G3 -->|Yes| G4[Cleanup images\ndrop_records]
G3 -->|No| G5[No-op]
Last reviewed commit: 401ef32
- Config: allow allow_resize with CELL_BY_CELL; relax validator - Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize; validate per row via _validate_cell_output - Builder: collect results by index when cell allow_resize, flatten and replace_buffer; add _log_resize_if_changed and _column_display_name - Docs: ALL_CAPS for strategies, simplify allow_resize table text - Tests: parametrized preview and multibatch; factories with n param; _RESIZE_SPECS with inline factory calls; ids ordered like specs
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Outdated
Show resolved
Hide resolved
- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd, cell_drop_all to _RESIZE_SPECS - Stubs before specs: _resize_full_keep_first, _resize_cell_expand, _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories - Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS - Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3 multibatch cases (5_2, 4_2) first - Handle all-batches-skipped in multibatch test (empty df when path missing) - test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)
- Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call
| def required_columns(self) -> list[str]: | ||
| return [] | ||
|
|
||
| class StubColumnConfigWithoutEmoji(SingleColumnConfig): |
There was a problem hiding this comment.
Moving this since the test for allow_resize uses the exact same stub
| ) | ||
| ``` | ||
|
|
||
| **CELL_BY_CELL:** With `allow_resize=True`, your function may return a single row (`dict`) or multiple rows (`list[dict]`). Return `[]` to drop that input row. |
There was a problem hiding this comment.
Wasn't doing that at first, but I think it's useful, especially since functions with LLM calls will typically use the cell-by-cell strategy.
| raise IndexError(f"🛑 Index {index} is out of bounds for buffer of size {len(self._buffer)}.") | ||
| self._buffer[index] = record | ||
|
|
||
| def update_records(self, records: list[dict]) -> None: |
There was a problem hiding this comment.
The replace_buffer function (introduced in #294) does the same as update_records but allow for the number of records to change. Here we're unifying the two. I've picked replace_buffer as the one to stay, but that can change.
| if new_count == 0: | ||
| logger.warning(f"⚠️ Column '{column_name}' reduced batch to 0 records. This batch will be skipped.") | ||
| else: | ||
| emoji = "💥" if new_count > original_count else "✂️" |
There was a problem hiding this comment.
The logging with these emojis will show up for every batch, not sure if that's needed
There was a problem hiding this comment.
Example will be removed later as usual, just for reference
There was a problem hiding this comment.
NB this example uses CustomColumns but plugins are also affected by this. I've added a comment about it on docs. I've also implemented a plugin locally to test it, seems to work fine (lots of LoCs so not adding to this PR)
| return [] | ||
|
|
||
|
|
||
| _RESIZE_SPECS: dict[str, list[tuple[str, object, GenerationStrategy]]] = { |
There was a problem hiding this comment.
I'm testing a bunch of edge cases, using this to parametrize them. Looks a bit weird, not sure how to improve.
Resolve conflict in column_wise_builder.py: keep allow_resize branch logic and add main's _cleanup_dropped_record_images before drop_records.
- Keep allow_resize setup in _setup_fan_out (store _current_column_display_name) - Keep allow_resize flatten in _finalize_fan_out using _cell_resize_mode - Integrate main's _cleanup_dropped_record_images in drop path
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
|
Thanks @nabinchha for comments, fixed everything! |
- Replace getattr with direct attribute access where config is always SingleColumnConfig (custom.py, cell-by-cell path in builder) - Keep getattr in _run_full_column_generator which also handles multi-column configs without allow_resize - Restructure allow_resize validation branching in CustomColumnGenerator - Fix error message wording: "key" -> "column"
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py
Outdated
Show resolved
Hide resolved
Additional Comments (1)
When Consider either skipping the warning in Prompt To Fix With AIThis is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Line: 192:201
Comment:
**Duplicate warning log for undeclared columns**
When `_validate_output` detects undeclared columns on a non-DataFrame result (line 192), it logs the warning (lines 193-196) and then delegates to `_validate_cell_output` (line 201). However, `_validate_cell_output` will log the same undeclared columns warning again (lines 148-152). This causes the user to see the same warning message twice for a single validation issue.
Consider either skipping the warning in `_validate_output` when delegating, or having `_validate_cell_output` accept an optional flag to suppress the log:
```suggestion
result = self._validate_cell_output(result, keys_before)
```
How can I resolve this? If you propose a fix, please make it concise. |
- Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time.
The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
| required_columns=["topic"], | ||
| side_effect_columns=["variation_id"], | ||
| ) | ||
| def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame: |
There was a problem hiding this comment.
Invalid parameter name in doc example
The @custom_column_generator decorator validates parameter names at decoration time and requires the second parameter to be named generator_params, not params. This example will raise TypeError: param 2 must be 'generator_params' if a user copies and runs it (confirmed by the test at test_custom.py:427).
| def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame: | |
| def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame: |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/custom_columns.md
Line: 107
Comment:
**Invalid parameter name in doc example**
The `@custom_column_generator` decorator validates parameter names at decoration time and requires the second parameter to be named `generator_params`, not `params`. This example will raise `TypeError: param 2 must be 'generator_params'` if a user copies and runs it (confirmed by the test at `test_custom.py:427`).
```suggestion
def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame:
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Adds
allow_resizefor custom (and plugin) column configs so generators can change record count: 1:N expansion and N:1 retraction. Supports both full_column (DataFrame in/out) and cell_by_cell (returndict | list[dict]per row). Addresses #265.Changes
Added
allow_resizeon SingleColumnConfig (defaultFalse), overridable by custom columns and pluginsTrue, buffer length may differ and_num_records_listis updated for batch bookkeeping;actual_num_recordsin dataset metadatadict | list[dict]whenallow_resize=True; builder flattens and replaces bufferChanged
update_recordsremoved; all call sites usereplace_buffer(..., allow_resize=False)orreplace_buffer(..., allow_resize=True)as appropriate. ProcessorRunner usesallow_resize=Trueallow_resizefrom base; validatorvalidate_allow_resize_requires_full_columnremovedgetattr(config, "allow_resize", False)for MultiColumnConfig compatibilityFixed
fprefix on error message in custom generator (8285b48)Description updated with AI