Skip to content

fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334

Open
andreatgretel wants to merge 4 commits intomainfrom
andreatgretel/fix/drop-columns-processor
Open

fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334
andreatgretel wants to merge 4 commits intomainfrom
andreatgretel/fix/drop-columns-processor

Conversation

@andreatgretel
Copy link
Contributor

📋 Summary

Fixes two bugs in DropColumnsProcessorConfig that affect notebook workflows: re-running add_processor with the same name now replaces the old config (upsert), and reasoning/trace columns can now be dropped.

Fixes #332

🔄 Changes

🐛 Fixed

  • add_processor now uses upsert semantics — calling it with the same processor name replaces the existing processor and reverts stale drop=True flags on columns, making notebook cells safely re-runnable
  • validate_drop_columns_processor now includes side-effect columns (__reasoning_content, __trace) in the set of valid column names, so reasoning columns can be dropped without validation errors

🧪 Tests

  • TestAddProcessorIdempotent: 3 tests covering upsert-replaces-by-name, different-names-append, and non-drop-processor replacement
  • test_validate_drop_columns_processor_accepts_reasoning_columns: reasoning column accepted when extract_reasoning_content=True
  • test_validate_drop_columns_processor_rejects_invalid_side_effect_column: still rejects __reasoning_content when the flag is not enabled

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • config_builder.py#_remove_processor_by_name — New private method that removes an existing processor and undoes its drop=True side-effects. Verify the revert logic is correct when a DropColumnsProcessor listed columns that are not in _column_configs (e.g., reasoning columns).

🤖 Generated with AI

… columns

- add_processor now uses upsert semantics: re-adding a processor with the
  same name replaces the old one and reverts its drop=True side-effects,
  making notebook cells safely re-runnable.
- validate_drop_columns_processor now includes side-effect columns
  (reasoning_content, trace) so reasoning columns can be dropped.

Fixes #332
@andreatgretel andreatgretel requested a review from a team as a code owner February 18, 2026 20:12
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 18, 2026

Greptile Summary

This PR fixes two bugs in DropColumnsProcessorConfig that affect notebook workflows:

  • Idempotent add_processor: Calling add_processor with the same processor name now replaces the old processor (upsert semantics) and correctly reverts stale drop=True flags on columns, checking that no other processor still needs the flag before reverting. This makes notebook cells safely re-runnable.
  • Glob pattern support: Column names in DropColumnsProcessorConfig now support glob patterns (e.g., col_*) across all three layers — config builder, runtime processor, and validation.
  • Reasoning/trace column support: validate_drop_columns_processor now includes side-effect columns (__reasoning_content, __trace) in the set of valid column names, so reasoning columns can be dropped without validation errors.
  • Validation now accumulates all violations instead of returning early after the first invalid column, and differentiates glob non-matches (WARNING) from explicit non-matches (ERROR).
  • Good test coverage with 6 idempotent tests, 2 glob processor tests, and 4 validation tests.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk — the logic changes are well-tested and the edge case of overlapping drop processors is handled correctly.
  • The code changes are logically sound: upsert semantics with proper flag revert, glob support across all layers, and side-effect column inclusion in validation. The overlapping drop processor edge case (flagged in a previous review) has been fixed. Test coverage is thorough. The only minor concern is duplicated glob detection logic across three files, but this is a style issue rather than a correctness problem.
  • config_builder.py deserves the most attention due to the _remove_processor_by_name side-effect revert logic, but it appears correct after careful review.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/config_builder.py Adds upsert semantics to add_processor via _remove_processor_by_name and glob-aware _resolve_drop_column_names. Correctly handles overlapping drop processors and reverts drop flags safely.
packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py Refactored to resolve column names (including globs) once via _resolve_columns, then pass the resolved list to both _save_dropped_columns and data.drop. Cleaner and avoids resolving twice.
packages/data-designer-engine/src/data_designer/engine/validation.py Validation now includes side_effect_columns (reasoning/trace) in the valid column set and supports glob patterns with appropriate WARNING vs ERROR severity. Accumulates all violations instead of returning early.
packages/data-designer-config/tests/config/test_config_builder.py Adds TestAddProcessorIdempotent class with 6 tests covering upsert, append, glob marking, glob revert, and overlapping drop processor scenarios. Good coverage of the new behavior.
packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py Adds two new parametrized test cases for glob patterns in DropColumnsProcessor: matching (col*) and non-matching (zzz*).
packages/data-designer-engine/tests/engine/test_validation.py Adds tests for reasoning column validation and glob pattern validation. Import moved to module level. Good parametrized test coverage.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["add_processor(config)"] --> B{"Processor with\nsame name exists?"}
    B -- Yes --> C["_remove_processor_by_name"]
    B -- No --> F{"Is DropColumns\nprocessor?"}
    C --> D{"Is existing a\nDropColumns processor?"}
    D -- Yes --> E["Revert drop flags\n(unless other processor\nstill drops column)"]
    D -- No --> G["Remove from list"]
    E --> G
    G --> F
    F -- Yes --> H["_resolve_drop_column_names\n(expand globs)"]
    H --> I["Set drop=True on\nmatching column configs"]
    F -- No --> J["Append processor"]
    I --> J
Loading

Last reviewed commit: daed9b5

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- Use parametrize for reasoning column validation cases
- Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup
- Move validate_drop_columns_processor import to top of file
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Patterns like "*__reasoning_content" or "col_*" are now expanded against
available columns at validation time and at runtime. Validation emits a
warning when a glob pattern matches no columns.
When removing a DropColumnsProcessor, only revert drop=True on columns
that are not also dropped by another processor.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +420 to +428
def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
"""Resolve column names, expanding glob patterns against known column configs."""
resolved = []
for name in column_names:
if any(c in name for c in "*?["):
resolved.extend(fnmatch.filter(self._column_configs.keys(), name))
elif name in self._column_configs:
resolved.append(name)
return resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible duplicates from overlapping patterns

_resolve_drop_column_names can return duplicate column names when column_names contains both an explicit name and a glob that matches it (e.g., ["col_a", "col_*"]). This doesn't cause a bug in current usage — setting drop = True or False twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order:

Suggested change
def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
"""Resolve column names, expanding glob patterns against known column configs."""
resolved = []
for name in column_names:
if any(c in name for c in "*?["):
resolved.extend(fnmatch.filter(self._column_configs.keys(), name))
elif name in self._column_configs:
resolved.append(name)
return resolved
def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
"""Resolve column names, expanding glob patterns against known column configs."""
seen: set[str] = set()
resolved = []
for name in column_names:
if any(c in name for c in "*?["):
for match in fnmatch.filter(self._column_configs.keys(), name):
if match not in seen:
seen.add(match)
resolved.append(match)
elif name in self._column_configs and name not in seen:
seen.add(name)
resolved.append(name)
return resolved
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/config_builder.py
Line: 420-428

Comment:
**Possible duplicates from overlapping patterns**

`_resolve_drop_column_names` can return duplicate column names when `column_names` contains both an explicit name and a glob that matches it (e.g., `["col_a", "col_*"]`). This doesn't cause a bug in current usage — setting `drop = True` or `False` twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order:

```suggestion
    def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
        """Resolve column names, expanding glob patterns against known column configs."""
        seen: set[str] = set()
        resolved = []
        for name in column_names:
            if any(c in name for c in "*?["):
                for match in fnmatch.filter(self._column_configs.keys(), name):
                    if match not in seen:
                        seen.add(match)
                        resolved.append(match)
            elif name in self._column_configs and name not in seen:
                seen.add(name)
                resolved.append(name)
        return resolved
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DropColumnsProcessorConfig: not idempotent on re-run and no support for reasoning columns

1 participant

Comments