fix: make DropColumnsProcessorConfig idempotent and support reasoning columns by andreatgretel · Pull Request #334 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-02-18T20:12:33Z

📋 Summary

Fixes two bugs in DropColumnsProcessorConfig that affect notebook workflows: re-running add_processor with the same name now replaces the old config (upsert), and reasoning/trace columns can now be dropped.

Fixes #332

🔄 Changes

🐛 Fixed

add_processor now uses upsert semantics — calling it with the same processor name replaces the existing processor and reverts stale drop=True flags on columns, making notebook cells safely re-runnable
validate_drop_columns_processor now includes side-effect columns (__reasoning_content, __trace) in the set of valid column names, so reasoning columns can be dropped without validation errors

🧪 Tests

TestAddProcessorIdempotent: 3 tests covering upsert-replaces-by-name, different-names-append, and non-drop-processor replacement
test_validate_drop_columns_processor_accepts_reasoning_columns: reasoning column accepted when extract_reasoning_content=True
test_validate_drop_columns_processor_rejects_invalid_side_effect_column: still rejects __reasoning_content when the flag is not enabled

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

config_builder.py#_remove_processor_by_name — New private method that removes an existing processor and undoes its drop=True side-effects. Verify the revert logic is correct when a DropColumnsProcessor listed columns that are not in _column_configs (e.g., reasoning columns).

🤖 Generated with AI

… columns - add_processor now uses upsert semantics: re-adding a processor with the same name replaces the old one and reverts its drop=True side-effects, making notebook cells safely re-runnable. - validate_drop_columns_processor now includes side-effect columns (reasoning_content, trace) so reasoning columns can be dropped. Fixes #332

greptile-apps · 2026-02-18T20:15:34Z

Greptile Summary

This PR fixes two bugs in DropColumnsProcessorConfig that affect notebook workflows:

Idempotent add_processor: Calling add_processor with the same processor name now replaces the old processor (upsert semantics) and correctly reverts stale drop=True flags on columns, checking that no other processor still needs the flag before reverting. This makes notebook cells safely re-runnable.
Glob pattern support: Column names in DropColumnsProcessorConfig now support glob patterns (e.g., col_*) across all three layers — config builder, runtime processor, and validation.
Reasoning/trace column support: validate_drop_columns_processor now includes side-effect columns (__reasoning_content, __trace) in the set of valid column names, so reasoning columns can be dropped without validation errors.
Validation now accumulates all violations instead of returning early after the first invalid column, and differentiates glob non-matches (WARNING) from explicit non-matches (ERROR).
Good test coverage with 6 idempotent tests, 2 glob processor tests, and 4 validation tests.

Confidence Score: 4/5

This PR is safe to merge with minimal risk — the logic changes are well-tested and the edge case of overlapping drop processors is handled correctly.
The code changes are logically sound: upsert semantics with proper flag revert, glob support across all layers, and side-effect column inclusion in validation. The overlapping drop processor edge case (flagged in a previous review) has been fixed. Test coverage is thorough. The only minor concern is duplicated glob detection logic across three files, but this is a style issue rather than a correctness problem.
config_builder.py deserves the most attention due to the _remove_processor_by_name side-effect revert logic, but it appears correct after careful review.

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/config/config_builder.py	Adds upsert semantics to `add_processor` via `_remove_processor_by_name` and glob-aware `_resolve_drop_column_names`. Correctly handles overlapping drop processors and reverts `drop` flags safely.
packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py	Refactored to resolve column names (including globs) once via `_resolve_columns`, then pass the resolved list to both `_save_dropped_columns` and `data.drop`. Cleaner and avoids resolving twice.
packages/data-designer-engine/src/data_designer/engine/validation.py	Validation now includes `side_effect_columns` (reasoning/trace) in the valid column set and supports glob patterns with appropriate WARNING vs ERROR severity. Accumulates all violations instead of returning early.
packages/data-designer-config/tests/config/test_config_builder.py	Adds `TestAddProcessorIdempotent` class with 6 tests covering upsert, append, glob marking, glob revert, and overlapping drop processor scenarios. Good coverage of the new behavior.
packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py	Adds two new parametrized test cases for glob patterns in `DropColumnsProcessor`: matching (`col`) and non-matching (`zzz`).
packages/data-designer-engine/tests/engine/test_validation.py	Adds tests for reasoning column validation and glob pattern validation. Import moved to module level. Good parametrized test coverage.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["add_processor(config)"] --> B{"Processor with\nsame name exists?"}
    B -- Yes --> C["_remove_processor_by_name"]
    B -- No --> F{"Is DropColumns\nprocessor?"}
    C --> D{"Is existing a\nDropColumns processor?"}
    D -- Yes --> E["Revert drop flags\n(unless other processor\nstill drops column)"]
    D -- No --> G["Remove from list"]
    E --> G
    G --> F
    F -- Yes --> H["_resolve_drop_column_names\n(expand globs)"]
    H --> I["Set drop=True on\nmatching column configs"]
    F -- No --> J["Append processor"]
    I --> J

_{Last reviewed commit: daed9b5}

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-config/src/data_designer/config/config_builder.py

packages/data-designer-engine/tests/engine/test_validation.py

- Use parametrize for reasoning column validation cases - Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup - Move validate_drop_columns_processor import to top of file

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-config/tests/config/test_config_builder.py

packages/data-designer-engine/tests/engine/test_validation.py

Patterns like "*__reasoning_content" or "col_*" are now expanded against available columns at validation time and at runtime. Validation emits a warning when a glob pattern matches no columns.

When removing a DropColumnsProcessor, only revert drop=True on columns that are not also dropped by another processor.

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-18T21:54:47Z

packages/data-designer-config/src/data_designer/config/config_builder.py

+    def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
+        """Resolve column names, expanding glob patterns against known column configs."""
+        resolved = []
+        for name in column_names:
+            if any(c in name for c in "*?["):
+                resolved.extend(fnmatch.filter(self._column_configs.keys(), name))
+            elif name in self._column_configs:
+                resolved.append(name)
+        return resolved


Possible duplicates from overlapping patterns

_resolve_drop_column_names can return duplicate column names when column_names contains both an explicit name and a glob that matches it (e.g., ["col_a", "col_*"]). This doesn't cause a bug in current usage — setting drop = True or False twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order:

Suggested change

def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:

"""Resolve column names, expanding glob patterns against known column configs."""

resolved = []

for name in column_names:

if any(c in name for c in "*?["):

resolved.extend(fnmatch.filter(self._column_configs.keys(), name))

elif name in self._column_configs:

resolved.append(name)

return resolved

def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:

"""Resolve column names, expanding glob patterns against known column configs."""

seen: set[str] = set()

resolved = []

for name in column_names:

if any(c in name for c in "*?["):

for match in fnmatch.filter(self._column_configs.keys(), name):

if match not in seen:

seen.add(match)

resolved.append(match)

elif name in self._column_configs and name not in seen:

seen.add(name)

resolved.append(name)

return resolved

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/data-designer-config/src/data_designer/config/config_builder.py Line: 420-428 Comment: **Possible duplicates from overlapping patterns** `_resolve_drop_column_names` can return duplicate column names when `column_names` contains both an explicit name and a glob that matches it (e.g., `["col_a", "col_*"]`). This doesn't cause a bug in current usage — setting `drop = True` or `False` twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order: ```suggestion def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]: """Resolve column names, expanding glob patterns against known column configs.""" seen: set[str] = set() resolved = [] for name in column_names: if any(c in name for c in "*?["): for match in fnmatch.filter(self._column_configs.keys(), name): if match not in seen: seen.add(match) resolved.append(match) elif name in self._column_configs and name not in seen: seen.add(name) resolved.append(name) return resolved ``` How can I resolve this? If you propose a fix, please make it concise.

andreatgretel requested a review from a team as a code owner February 18, 2026 20:12

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/config_builder.py Show resolved Hide resolved

packages/data-designer-engine/tests/engine/test_validation.py Outdated Show resolved Hide resolved

test: reduce duplication in drop-columns tests

48709f2

- Use parametrize for reasoning column validation cases - Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup - Move validate_drop_columns_processor import to top of file

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

packages/data-designer-config/tests/config/test_config_builder.py Show resolved Hide resolved

packages/data-designer-engine/tests/engine/test_validation.py Show resolved Hide resolved

feat: support glob patterns in DropColumnsProcessorConfig column_names

37486c0

Patterns like "*__reasoning_content" or "col_*" are now expanded against available columns at validation time and at runtime. Validation emits a warning when a glob pattern matches no columns.

andreatgretel mentioned this pull request Feb 18, 2026

DropColumnsProcessorConfig: not idempotent on re-run and no support for reasoning columns #332

Open

fix: preserve drop flag when column is referenced by other processors

daed9b5

When removing a DropColumnsProcessor, only revert drop=True on columns that are not also dropped by another processor.

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334

fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334
andreatgretel wants to merge 4 commits intomainfrom
andreatgretel/fix/drop-columns-processor

andreatgretel commented Feb 18, 2026

Uh oh!

greptile-apps bot commented Feb 18, 2026 •

edited

Loading

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

andreatgretel commented Feb 18, 2026

📋 Summary

🔄 Changes

🐛 Fixed

🧪 Tests

🔍 Attention Areas

Uh oh!

greptile-apps bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

greptile-apps bot commented Feb 18, 2026 •

edited

Loading