feat: add allow_resize for 1:N and N:1 generation patterns by andreatgretel · Pull Request #286 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-02-03T01:06:45Z

Summary

Adds allow_resize for custom (and plugin) column configs so generators can change record count: 1:N expansion and N:1 retraction. Supports both full_column (DataFrame in/out) and cell_by_cell (return dict | list[dict] per row). Addresses #265.

Changes

Added

allow_resize on SingleColumnConfig (default False), overridable by custom columns and plugins
replace_buffer(records, *, allow_resize=False): when True, buffer length may differ and _num_records_list is updated for batch bookkeeping; actual_num_records in dataset metadata
Resize logging (💥 expand / ✂️ retract) in column_wise_builder.py
Cell-by-cell resize: custom generator accepts dict | list[dict] when allow_resize=True; builder flattens and replaces buffer
Docs and example_allow_resize.py: chained expand → retract → expand, preview + multi-batch build
Tests: config (StubColumnConfig, allow_resize inheritance), custom (dict/list/empty/invalid return), batch manager (replace_buffer + allow_resize, actual_num_records), column_wise_builder (parametrized resize specs, multi-batch)

Changed

Single buffer API: update_records removed; all call sites use replace_buffer(..., allow_resize=False) or replace_buffer(..., allow_resize=True) as appropriate. ProcessorRunner uses allow_resize=True
CustomColumnConfig inherits allow_resize from base; validator validate_allow_resize_requires_full_column removed
Builder uses getattr(config, "allow_resize", False) for MultiColumnConfig compatibility

Fixed

Missing f prefix on error message in custom generator (8285b48)

Description updated with AI

packages/data-designer-config/src/data_designer/config/column_configs.py

Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests

- Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns

- Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract)

greptile-apps · 2026-02-12T17:09:53Z

Greptile Summary

This PR adds allow_resize support to enable 1:N (expansion) and N:1 (retraction) generation patterns in custom and plugin column generators. The core change adds allow_resize: bool = False to SingleColumnConfig, consolidates the buffer update API into a single replace_buffer(records, *, allow_resize=False) method on DatasetBatchManager, and wires resize awareness through both FULL_COLUMN and CELL_BY_CELL generation paths in ColumnWiseDatasetBuilder.

Config layer: allow_resize field added to SingleColumnConfig base class, inherited by all column types including CustomColumnConfig and plugins
Custom generator: CELL_BY_CELL generators with allow_resize=True can now return dict | list[dict] (single row, multiple rows, or empty list to drop); new _validate_cell_output validates each expanded row independently
Builder orchestration: _setup_fan_out / _finalize_fan_out collect cell results into _cell_resize_results list, then flatten and replace buffer; _run_full_column_generator passes allow_resize through to replace_buffer
Batch manager: update_records removed; replace_buffer now handles both strict 1:1 and resize modes; _actual_num_records tracks real output count in metadata
Docs: Documentation example for FULL_COLUMN resize uses incorrect parameter name params instead of the required generator_params — will raise TypeError if copied verbatim
Tests: Comprehensive coverage across unit and integration levels — parametrized preview + multi-batch tests for cell filter, cell expand, cell drop-all, full-column expand, full-column chain, and mixed cell+full chains

Confidence Score: 4/5

This PR is safe to merge after fixing the documentation parameter name error; core logic is well-tested.
The implementation is architecturally sound with a clean API consolidation (replace_buffer with allow_resize flag), proper separation between resize and non-resize code paths, and thorough test coverage including multi-batch integration tests. One documentation bug (wrong parameter name in example) would cause user-facing errors. The core engine and config changes are well-structured and the test matrix covers key scenarios (expand, retract, drop-all, chained operations).
docs/concepts/custom_columns.md has a parameter naming error that will cause runtime errors if users copy the example.

Important Files Changed

Filename	Overview
docs/concepts/custom_columns.md	Adds `allow_resize` documentation with FULL_COLUMN and CELL_BY_CELL examples. The FULL_COLUMN example uses wrong parameter name `params` instead of `generator_params`.
packages/data-designer-config/src/data_designer/config/base.py	Adds `allow_resize: bool = False` to `SingleColumnConfig` base class. Clean, minimal change that enables all subclasses to inherit the field.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py	Adds cell-by-cell resize support: `generate()` returns `dict
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py	Core resize orchestration: adds `_cell_resize_mode` / `_cell_resize_results` for cell-by-cell buffer collection, resize logging, and flattening in `_finalize_fan_out`. Uses `getattr` for `MultiColumnConfig` compatibility in full-column path.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py	Consolidates `update_records` and old `replace_buffer` into single `replace_buffer(records, *, allow_resize=False)` method. Adds `_actual_num_records` tracking and metadata field. Clean API simplification.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py	Updates `run_pre_batch` to use `replace_buffer(..., allow_resize=True)` since processors can change record count.
packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py	Thorough integration tests: parametrized preview and multi-batch build tests for cell filter, cell expand, cell drop-all, full-column expand, full-column chain, and mixed cell+full chain. Validates row counts and ordering.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ColumnWiseDatasetBuilder._run_batch] --> B{Generation Strategy?}
    B -->|FULL_COLUMN| C[_run_full_column_generator]
    B -->|CELL_BY_CELL| D[_run_cell_by_cell_generator]

    C --> C1[generator.generate DataFrame]
    C1 --> C2{allow_resize?}
    C2 -->|Yes| C3[replace_buffer allow_resize=True\nupdates _num_records_list]
    C2 -->|No| C4[replace_buffer allow_resize=False\nstrict 1:1 check]

    D --> D1[_setup_fan_out]
    D1 --> D2{allow_resize?}
    D2 -->|Yes| D3[Init _cell_resize_results\n_cell_resize_mode = True]
    D2 -->|No| D4[_cell_resize_mode = False]
    D3 --> E[Fan out workers threads/async]
    D4 --> E

    E --> F[_worker_result_callback]
    F --> F1{_cell_resize_mode?}
    F1 -->|Yes| F2[Store dict/list in\n_cell_resize_results index]
    F1 -->|No| F3[update_record in buffer]

    E --> G[_finalize_fan_out]
    G --> G1{_cell_resize_mode?}
    G1 -->|Yes| G2[Flatten results\nskip _records_to_drop\nreplace_buffer allow_resize=True]
    G1 -->|No| G3{_records_to_drop?}
    G3 -->|Yes| G4[Cleanup images\ndrop_records]
    G3 -->|No| G5[No-op]

_{Last reviewed commit: 401ef32}

- Config: allow allow_resize with CELL_BY_CELL; relax validator - Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize; validate per row via _validate_cell_output - Builder: collect results by index when cell allow_resize, flatten and replace_buffer; add _log_resize_if_changed and _column_display_name - Docs: ALL_CAPS for strategies, simplify allow_resize table text - Tests: parametrized preview and multibatch; factories with n param; _RESIZE_SPECS with inline factory calls; ids ordered like specs

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py

- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd, cell_drop_all to _RESIZE_SPECS - Stubs before specs: _resize_full_keep_first, _resize_cell_expand, _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories - Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS - Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3 multibatch cases (5_2, 4_2) first - Handle all-batches-skipped in multibatch test (empty df when path missing) - test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)

- Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call

andreatgretel · 2026-02-12T21:00:36Z

packages/data-designer-config/tests/config/test_columns.py

+    def required_columns(self) -> list[str]:
+        return []

-    class StubColumnConfigWithoutEmoji(SingleColumnConfig):


Moving this since the test for allow_resize uses the exact same stub

andreatgretel · 2026-02-12T21:02:41Z

docs/concepts/custom_columns.md

+)
+```
+
+**CELL_BY_CELL:** With `allow_resize=True`, your function may return a single row (`dict`) or multiple rows (`list[dict]`). Return `[]` to drop that input row.


Wasn't doing that at first, but I think it's useful, especially since functions with LLM calls will typically use the cell-by-cell strategy.

andreatgretel · 2026-02-12T21:05:01Z

...ata-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py

            raise IndexError(f"🛑 Index {index} is out of bounds for buffer of size {len(self._buffer)}.")
        self._buffer[index] = record

-    def update_records(self, records: list[dict]) -> None:


The replace_buffer function (introduced in #294) does the same as update_records but allow for the number of records to change. Here we're unifying the two. I've picked replace_buffer as the one to stay, but that can change.

andreatgretel · 2026-02-12T21:05:31Z

packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py

+        if new_count == 0:
+            logger.warning(f"⚠️ Column '{column_name}' reduced batch to 0 records. This batch will be skipped.")
+        else:
+            emoji = "💥" if new_count > original_count else "✂️"


The logging with these emojis will show up for every batch, not sure if that's needed

andreatgretel · 2026-02-12T21:05:55Z

example_allow_resize.py

Example will be removed later as usual, just for reference

NB this example uses CustomColumns but plugins are also affected by this. I've added a comment about it on docs. I've also implemented a plugin locally to test it, seems to work fine (lots of LoCs so not adding to this PR)

andreatgretel · 2026-02-12T21:08:00Z

packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py

+    return []
+
+
+_RESIZE_SPECS: dict[str, list[tuple[str, object, GenerationStrategy]]] = {


I'm testing a bunch of edge cases, using this to parametrize them. Looks a bit weird, not sure how to improve.

…ugins

Resolve conflict in column_wise_builder.py: keep allow_resize branch logic and add main's _cleanup_dropped_record_images before drop_records.

- Keep allow_resize setup in _setup_fan_out (store _current_column_display_name) - Keep allow_resize flatten in _finalize_fan_out using _cell_resize_mode - Integrate main's _cleanup_dropped_record_images in drop path

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py

packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py

andreatgretel · 2026-02-18T20:05:18Z

Thanks @nabinchha for comments, fixed everything!

- Replace getattr with direct attribute access where config is always SingleColumnConfig (custom.py, cell-by-cell path in builder) - Keep getattr in _run_full_column_generator which also handles multi-column configs without allow_resize - Restructure allow_resize validation branching in CustomColumnGenerator - Fix error message wording: "key" -> "column"

greptile-apps

_{12 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py

packages/data-designer-engine/tests/engine/dataset_builders/test_column_wise_builder.py

greptile-apps · 2026-02-18T20:17:44Z

Additional Comments (1)

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Duplicate warning log for undeclared columns

When _validate_output detects undeclared columns on a non-DataFrame result (line 192), it logs the warning (lines 193-196) and then delegates to _validate_cell_output (line 201). However, _validate_cell_output will log the same undeclared columns warning again (lines 148-152). This causes the user to see the same warning message twice for a single validation issue.

Consider either skipping the warning in _validate_output when delegating, or having _validate_cell_output accept an optional flag to suppress the log:

                result = self._validate_cell_output(result, keys_before)

Prompt To Fix With AI

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py
Line: 192:201

Comment:
**Duplicate warning log for undeclared columns**

When `_validate_output` detects undeclared columns on a non-DataFrame result (line 192), it logs the warning (lines 193-196) and then delegates to `_validate_cell_output` (line 201). However, `_validate_cell_output` will log the same undeclared columns warning again (lines 148-152). This causes the user to see the same warning message twice for a single validation issue.

Consider either skipping the warning in `_validate_output` when delegating, or having `_validate_cell_output` accept an optional flag to suppress the log:

```suggestion
                result = self._validate_cell_output(result, keys_before)
```

How can I resolve this? If you propose a fix, please make it concise.

- Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory

nabinchha

🚢

greptile-apps

_{12 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py

packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py

Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time.

The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-18T21:40:43Z

docs/concepts/custom_columns.md

+    required_columns=["topic"],
+    side_effect_columns=["variation_id"],
+)
+def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:


Invalid parameter name in doc example
The @custom_column_generator decorator validates parameter names at decoration time and requires the second parameter to be named generator_params, not params. This example will raise TypeError: param 2 must be 'generator_params' if a user copies and runs it (confirmed by the test at test_custom.py:427).

Suggested change

def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:

def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame:

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/concepts/custom_columns.md Line: 107 Comment: **Invalid parameter name in doc example** The `@custom_column_generator` decorator validates parameter names at decoration time and requires the second parameter to be named `generator_params`, not `params`. This example will raise `TypeError: param 2 must be 'generator_params'` if a user copies and runs it (confirmed by the test at `test_custom.py:427`). ```suggestion def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame: ``` How can I resolve this? If you propose a fix, please make it concise.

andreatgretel force-pushed the andreatgretel/feat/custom-column branch 3 times, most recently from 3c9fa49 to 8ba264c Compare February 3, 2026 20:05

andreatgretel force-pushed the andreatgretel/feat/allow-resize branch 2 times, most recently from 0cedd56 to 8c9a33e Compare February 3, 2026 22:33

andreatgretel changed the base branch from andreatgretel/feat/custom-column to main February 3, 2026 22:35

andreatgretel mentioned this pull request Feb 3, 2026

Relax strict 1:1 verification in DatasetBatchManager.update_records (Support 1:N generation) #265

Closed

johnnygreco reviewed Feb 5, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/column_configs.py Outdated Show resolved Hide resolved

andreatgretel mentioned this pull request Feb 11, 2026

refactor: callback-based processor design #294

Merged

andreatgretel added 3 commits February 12, 2026 11:42

docs: add allow_resize to custom columns documentation

7b31cea

andreatgretel force-pushed the andreatgretel/feat/allow-resize branch from d4a6100 to d84d799 Compare February 12, 2026 15:26

andreatgretel force-pushed the andreatgretel/feat/allow-resize branch from d84d799 to 1a16906 Compare February 12, 2026 16:53

andreatgretel marked this pull request as ready for review February 12, 2026 17:07

andreatgretel requested a review from a team as a code owner February 12, 2026 17:07

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py Outdated Show resolved Hide resolved

andreatgretel added 3 commits February 12, 2026 15:16

tidy allow_resize: drop validator, shared stub, explicit flag

ab8ce11

- Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call

fix: add missing f prefix to error message in custom.py

8285b48

andreatgretel commented Feb 12, 2026

View reviewed changes

docs(plugins): add section on setting allow_resize=True for resize pl…

4394e76

…ugins

andreatgretel added 2 commits February 12, 2026 19:03

Merge origin/main into andreatgretel/feat/allow-resize

a163673

Resolve conflict in column_wise_builder.py: keep allow_resize branch logic and add main's _cleanup_dropped_record_images before drop_records.

Merge origin/main: resolve column_wise_builder refactor

ede1bd9

- Keep allow_resize setup in _setup_fan_out (store _current_column_display_name) - Keep allow_resize flatten in _finalize_fan_out using _cell_resize_mode - Integrate main's _cleanup_dropped_record_images in drop path

nabinchha previously approved these changes Feb 18, 2026

View reviewed changes

andreatgretel dismissed nabinchha’s stale review via 3c565ef February 18, 2026 20:11

Merge branch 'main' into andreatgretel/feat/allow-resize

f2b4f2e

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

fix: remove duplicate tool_alias log, fix test docstring

1e82937

- Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory

nabinchha reviewed Feb 18, 2026

View reviewed changes

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py Outdated Show resolved Hide resolved

packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py Show resolved Hide resolved

andreatgretel and others added 2 commits February 18, 2026 18:30

fix: avoid duplicate undeclared-column warning in _validate_output

fd2675f

Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time.

Merge branch 'main' into andreatgretel/feat/allow-resize

c471a02

nabinchha previously approved these changes Feb 18, 2026

View reviewed changes

fix: use lazy.pd instead of pd for runtime pandas usage in tests

401ef32

The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.

andreatgretel dismissed nabinchha’s stale review via 401ef32 February 18, 2026 21:35

nabinchha approved these changes Feb 18, 2026

View reviewed changes

andreatgretel merged commit 70dc488 into main Feb 18, 2026
47 checks passed

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

andreatgretel mentioned this pull request Feb 19, 2026

chore: remove example_allow_resize.py from repo root #337

Merged

		return []


		_RESIZE_SPECS: dict[str, list[tuple[str, object, GenerationStrategy]]] = {

	def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
	def expand_topics(df: pd.DataFrame, generator_params: None, models: dict) -> pd.DataFrame:

Conversation

andreatgretel commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Added

Changed

Fixed

Uh oh!

Uh oh!

greptile-apps bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Feb 18, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 18, 2026

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

andreatgretel commented Feb 3, 2026 •

edited

Loading

greptile-apps bot commented Feb 12, 2026 •

edited

Loading