Skip to content

feat: add allow_resize for 1:N and N:1 generation patterns#286

Merged
andreatgretel merged 17 commits intomainfrom
andreatgretel/feat/allow-resize
Feb 18, 2026
Merged

feat: add allow_resize for 1:N and N:1 generation patterns#286
andreatgretel merged 17 commits intomainfrom
andreatgretel/feat/allow-resize

Conversation

@andreatgretel
Copy link
Contributor

@andreatgretel andreatgretel commented Feb 3, 2026

Summary

Adds allow_resize for custom (and plugin) column configs so generators can change record count: 1:N expansion and N:1 retraction. Supports both full_column (DataFrame in/out) and cell_by_cell (return dict | list[dict] per row). Addresses #265.

Changes

Added

  • allow_resize on SingleColumnConfig (default False), overridable by custom columns and plugins
  • replace_buffer(records, *, allow_resize=False): when True, buffer length may differ and _num_records_list is updated for batch bookkeeping; actual_num_records in dataset metadata
  • Resize logging (💥 expand / ✂️ retract) in column_wise_builder.py
  • Cell-by-cell resize: custom generator accepts dict | list[dict] when allow_resize=True; builder flattens and replaces buffer
  • Docs and example_allow_resize.py: chained expand → retract → expand, preview + multi-batch build
  • Tests: config (StubColumnConfig, allow_resize inheritance), custom (dict/list/empty/invalid return), batch manager (replace_buffer + allow_resize, actual_num_records), column_wise_builder (parametrized resize specs, multi-batch)

Changed

  • Single buffer API: update_records removed; all call sites use replace_buffer(..., allow_resize=False) or replace_buffer(..., allow_resize=True) as appropriate. ProcessorRunner uses allow_resize=True
  • CustomColumnConfig inherits allow_resize from base; validator validate_allow_resize_requires_full_column removed
  • Builder uses getattr(config, "allow_resize", False) for MultiColumnConfig compatibility

Fixed

  • Missing f prefix on error message in custom generator (8285b48)

Description updated with AI

@andreatgretel andreatgretel force-pushed the andreatgretel/feat/custom-column branch 3 times, most recently from 3c9fa49 to 8ba264c Compare February 3, 2026 20:05
@andreatgretel andreatgretel force-pushed the andreatgretel/feat/allow-resize branch 2 times, most recently from 0cedd56 to 8c9a33e Compare February 3, 2026 22:33
@andreatgretel andreatgretel changed the base branch from andreatgretel/feat/custom-column to main February 3, 2026 22:35
Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.

Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests
- Merge update_records and replace_buffer into a single replace_buffer
  method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
  so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns
@andreatgretel andreatgretel force-pushed the andreatgretel/feat/allow-resize branch from d4a6100 to d84d799 Compare February 12, 2026 15:26
- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)
@andreatgretel andreatgretel force-pushed the andreatgretel/feat/allow-resize branch from d84d799 to 1a16906 Compare February 12, 2026 16:53
@andreatgretel andreatgretel marked this pull request as ready for review February 12, 2026 17:07
@andreatgretel andreatgretel requested a review from a team as a code owner February 12, 2026 17:07
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Greptile Summary

This PR adds allow_resize support to enable 1:N (expansion) and N:1 (retraction) generation patterns in custom and plugin column generators. The core change adds allow_resize: bool = False to SingleColumnConfig, consolidates the buffer update API into a single replace_buffer(records, *, allow_resize=False) method on DatasetBatchManager, and wires resize awareness through both FULL_COLUMN and CELL_BY_CELL generation paths in ColumnWiseDatasetBuilder.

  • Config layer: allow_resize field added to SingleColumnConfig base class, inherited by all column types including CustomColumnConfig and plugins
  • Custom generator: CELL_BY_CELL generators with allow_resize=True can now return dict | list[dict] (single row, multiple rows, or empty list to drop); new _validate_cell_output validates each expanded row independently
  • Builder orchestration: _setup_fan_out / _finalize_fan_out collect cell results into _cell_resize_results list, then flatten and replace buffer; _run_full_column_generator passes allow_resize through to replace_buffer
  • Batch manager: update_records removed; replace_buffer now handles both strict 1:1 and resize modes; _actual_num_records tracks real output count in metadata
  • Docs: Documentation example for FULL_COLUMN resize uses incorrect parameter name params instead of the required generator_params — will raise TypeError if copied verbatim
  • Tests: Comprehensive coverage across unit and integration levels — parametrized preview + multi-batch tests for cell filter, cell expand, cell drop-all, full-column expand, full-column chain, and mixed cell+full chains

Confidence Score: 4/5

  • This PR is safe to merge after fixing the documentation parameter name error; core logic is well-tested.
  • The implementation is architecturally sound with a clean API consolidation (replace_buffer with allow_resize flag), proper separation between resize and non-resize code paths, and thorough test coverage including multi-batch integration tests. One documentation bug (wrong parameter name in example) would cause user-facing errors. The core engine and config changes are well-structured and the test matrix covers key scenarios (expand, retract, drop-all, chained operations).
  • docs/concepts/custom_columns.md has a parameter naming error that will cause runtime errors if users copy the example.

Important Files Changed

Filename Overview
docs/concepts/custom_columns.md Adds allow_resize documentation with FULL_COLUMN and CELL_BY_CELL examples. The FULL_COLUMN example uses wrong parameter name params instead of generator_params.
packages/data-designer-config/src/data_designer/config/base.py Adds allow_resize: bool = False to SingleColumnConfig base class. Clean, minimal change that enables all subclasses to inherit the field.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py Adds cell-by-cell resize support: generate() returns `dict
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py Core resize orchestration: adds _cell_resize_mode / _cell_resize_results for cell-by-cell buffer collection, resize logging, and flattening in _finalize_fan_out. Uses getattr for MultiColumnConfig compatibility in full-column path.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py Consolidates update_records and old replace_buffer into single replace_buffer(records, *, allow_resize=False) method. Adds _actual_num_records tracking and metadata field. Clean API simplification.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py Updates run_pre_batch to use replace_buffer(..., allow_resize=True) since processors can change record count.
packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py Thorough integration tests: parametrized preview and multi-batch build tests for cell filter, cell expand, cell drop-all, full-column expand, full-column chain, and mixed cell+full chain. Validates row counts and ordering.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ColumnWiseDatasetBuilder._run_batch] --> B{Generation Strategy?}
    B -->|FULL_COLUMN| C[_run_full_column_generator]
    B -->|CELL_BY_CELL| D[_run_cell_by_cell_generator]

    C --> C1[generator.generate DataFrame]
    C1 --> C2{allow_resize?}
    C2 -->|Yes| C3[replace_buffer allow_resize=True\nupdates _num_records_list]
    C2 -->|No| C4[replace_buffer allow_resize=False\nstrict 1:1 check]

    D --> D1[_setup_fan_out]
    D1 --> D2{allow_resize?}
    D2 -->|Yes| D3[Init _cell_resize_results\n_cell_resize_mode = True]
    D2 -->|No| D4[_cell_resize_mode = False]
    D3 --> E[Fan out workers threads/async]
    D4 --> E

    E --> F[_worker_result_callback]
    F --> F1{_cell_resize_mode?}
    F1 -->|Yes| F2[Store dict/list in\n_cell_resize_results index]
    F1 -->|No| F3[update_record in buffer]

    E --> G[_finalize_fan_out]
    G --> G1{_cell_resize_mode?}
    G1 -->|Yes| G2[Flatten results\nskip _records_to_drop\nreplace_buffer allow_resize=True]
    G1 -->|No| G3{_records_to_drop?}
    G3 -->|Yes| G4[Cleanup images\ndrop_records]
    G3 -->|No| G5[No-op]
Loading

Last reviewed commit: 401ef32

- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
  validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
  replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
  _RESIZE_SPECS with inline factory calls; ids ordered like specs
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
  cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
  _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
  multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)
- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call
def required_columns(self) -> list[str]:
return []

class StubColumnConfigWithoutEmoji(SingleColumnConfig):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this since the test for allow_resize uses the exact same stub

)
```

**CELL_BY_CELL:** With `allow_resize=True`, your function may return a single row (`dict`) or multiple rows (`list[dict]`). Return `[]` to drop that input row.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't doing that at first, but I think it's useful, especially since functions with LLM calls will typically use the cell-by-cell strategy.

raise IndexError(f"🛑 Index {index} is out of bounds for buffer of size {len(self._buffer)}.")
self._buffer[index] = record

def update_records(self, records: list[dict]) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace_buffer function (introduced in #294) does the same as update_records but allow for the number of records to change. Here we're unifying the two. I've picked replace_buffer as the one to stay, but that can change.

if new_count == 0:
logger.warning(f"⚠️ Column '{column_name}' reduced batch to 0 records. This batch will be skipped.")
else:
emoji = "💥" if new_count > original_count else "✂️"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging with these emojis will show up for every batch, not sure if that's needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example will be removed later as usual, just for reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB this example uses CustomColumns but plugins are also affected by this. I've added a comment about it on docs. I've also implemented a plugin locally to test it, seems to work fine (lots of LoCs so not adding to this PR)

return []


_RESIZE_SPECS: dict[str, list[tuple[str, object, GenerationStrategy]]] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing a bunch of edge cases, using this to parametrize them. Looks a bit weird, not sure how to improve.

Resolve conflict in column_wise_builder.py: keep allow_resize branch
logic and add main's _cleanup_dropped_record_images before drop_records.
- Keep allow_resize setup in _setup_fan_out (store _current_column_display_name)
- Keep allow_resize flatten in _finalize_fan_out using _cell_resize_mode
- Integrate main's _cleanup_dropped_record_images in drop path
@andreatgretel
Copy link
Contributor Author

Thanks @nabinchha for comments, fixed everything!

- Replace getattr with direct attribute access where config is always
  SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
  multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 18, 2026

Additional Comments (1)

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Duplicate warning log for undeclared columns

When _validate_output detects undeclared columns on a non-DataFrame result (line 192), it logs the warning (lines 193-196) and then delegates to _validate_cell_output (line 201). However, _validate_cell_output will log the same undeclared columns warning again (lines 148-152). This causes the user to see the same warning message twice for a single validation issue.

Consider either skipping the warning in _validate_output when delegating, or having _validate_cell_output accept an optional flag to suppress the log:

                result = self._validate_cell_output(result, keys_before)
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Line: 192:201

Comment:
**Duplicate warning log for undeclared columns**

When `_validate_output` detects undeclared columns on a non-DataFrame result (line 192), it logs the warning (lines 193-196) and then delegates to `_validate_cell_output` (line 201). However, `_validate_cell_output` will log the same undeclared columns warning again (lines 148-152). This causes the user to see the same warning message twice for a single validation issue.

Consider either skipping the warning in `_validate_output` when delegating, or having `_validate_cell_output` accept an optional flag to suppress the log:

```suggestion
                result = self._validate_cell_output(result, keys_before)
```

How can I resolve this? If you propose a fix, please make it concise.

- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

andreatgretel and others added 2 commits February 18, 2026 18:30
Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.
nabinchha
nabinchha previously approved these changes Feb 18, 2026
The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
@andreatgretel andreatgretel merged commit 70dc488 into main Feb 18, 2026
47 checks passed
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

required_columns=["topic"],
side_effect_columns=["variation_id"],
)
def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid parameter name in doc example
The @custom_column_generator decorator validates parameter names at decoration time and requires the second parameter to be named generator_params, not params. This example will raise TypeError: param 2 must be 'generator_params' if a user copies and runs it (confirmed by the test at test_custom.py:427).

Suggested change
def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame:
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/custom_columns.md
Line: 107

Comment:
**Invalid parameter name in doc example**
The `@custom_column_generator` decorator validates parameter names at decoration time and requires the second parameter to be named `generator_params`, not `params`. This example will raise `TypeError: param 2 must be 'generator_params'` if a user copies and runs it (confirmed by the test at `test_custom.py:427`).

```suggestion
def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame:
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments