fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334
Open
andreatgretel wants to merge 4 commits intomainfrom
Open
fix: make DropColumnsProcessorConfig idempotent and support reasoning columns#334andreatgretel wants to merge 4 commits intomainfrom
andreatgretel wants to merge 4 commits intomainfrom
Conversation
… columns - add_processor now uses upsert semantics: re-adding a processor with the same name replaces the old one and reverts its drop=True side-effects, making notebook cells safely re-runnable. - validate_drop_columns_processor now includes side-effect columns (reasoning_content, trace) so reasoning columns can be dropped. Fixes #332
Contributor
Greptile SummaryThis PR fixes two bugs in
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/config_builder.py | Adds upsert semantics to add_processor via _remove_processor_by_name and glob-aware _resolve_drop_column_names. Correctly handles overlapping drop processors and reverts drop flags safely. |
| packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py | Refactored to resolve column names (including globs) once via _resolve_columns, then pass the resolved list to both _save_dropped_columns and data.drop. Cleaner and avoids resolving twice. |
| packages/data-designer-engine/src/data_designer/engine/validation.py | Validation now includes side_effect_columns (reasoning/trace) in the valid column set and supports glob patterns with appropriate WARNING vs ERROR severity. Accumulates all violations instead of returning early. |
| packages/data-designer-config/tests/config/test_config_builder.py | Adds TestAddProcessorIdempotent class with 6 tests covering upsert, append, glob marking, glob revert, and overlapping drop processor scenarios. Good coverage of the new behavior. |
| packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py | Adds two new parametrized test cases for glob patterns in DropColumnsProcessor: matching (col*) and non-matching (zzz*). |
| packages/data-designer-engine/tests/engine/test_validation.py | Adds tests for reasoning column validation and glob pattern validation. Import moved to module level. Good parametrized test coverage. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["add_processor(config)"] --> B{"Processor with\nsame name exists?"}
B -- Yes --> C["_remove_processor_by_name"]
B -- No --> F{"Is DropColumns\nprocessor?"}
C --> D{"Is existing a\nDropColumns processor?"}
D -- Yes --> E["Revert drop flags\n(unless other processor\nstill drops column)"]
D -- No --> G["Remove from list"]
E --> G
G --> F
F -- Yes --> H["_resolve_drop_column_names\n(expand globs)"]
H --> I["Set drop=True on\nmatching column configs"]
F -- No --> J["Append processor"]
I --> J
Last reviewed commit: daed9b5
packages/data-designer-config/src/data_designer/config/config_builder.py
Show resolved
Hide resolved
- Use parametrize for reasoning column validation cases - Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup - Move validate_drop_columns_processor import to top of file
Patterns like "*__reasoning_content" or "col_*" are now expanded against available columns at validation time and at runtime. Validation emits a warning when a glob pattern matches no columns.
When removing a DropColumnsProcessor, only revert drop=True on columns that are not also dropped by another processor.
Comment on lines
+420
to
+428
| def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]: | ||
| """Resolve column names, expanding glob patterns against known column configs.""" | ||
| resolved = [] | ||
| for name in column_names: | ||
| if any(c in name for c in "*?["): | ||
| resolved.extend(fnmatch.filter(self._column_configs.keys(), name)) | ||
| elif name in self._column_configs: | ||
| resolved.append(name) | ||
| return resolved |
Contributor
There was a problem hiding this comment.
Possible duplicates from overlapping patterns
_resolve_drop_column_names can return duplicate column names when column_names contains both an explicit name and a glob that matches it (e.g., ["col_a", "col_*"]). This doesn't cause a bug in current usage — setting drop = True or False twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order:
Suggested change
| def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]: | |
| """Resolve column names, expanding glob patterns against known column configs.""" | |
| resolved = [] | |
| for name in column_names: | |
| if any(c in name for c in "*?["): | |
| resolved.extend(fnmatch.filter(self._column_configs.keys(), name)) | |
| elif name in self._column_configs: | |
| resolved.append(name) | |
| return resolved | |
| def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]: | |
| """Resolve column names, expanding glob patterns against known column configs.""" | |
| seen: set[str] = set() | |
| resolved = [] | |
| for name in column_names: | |
| if any(c in name for c in "*?["): | |
| for match in fnmatch.filter(self._column_configs.keys(), name): | |
| if match not in seen: | |
| seen.add(match) | |
| resolved.append(match) | |
| elif name in self._column_configs and name not in seen: | |
| seen.add(name) | |
| resolved.append(name) | |
| return resolved |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-config/src/data_designer/config/config_builder.py
Line: 420-428
Comment:
**Possible duplicates from overlapping patterns**
`_resolve_drop_column_names` can return duplicate column names when `column_names` contains both an explicit name and a glob that matches it (e.g., `["col_a", "col_*"]`). This doesn't cause a bug in current usage — setting `drop = True` or `False` twice is harmless — but it could lead to subtle issues if this method is reused elsewhere. Consider deduplicating while preserving order:
```suggestion
def _resolve_drop_column_names(self, column_names: list[str]) -> list[str]:
"""Resolve column names, expanding glob patterns against known column configs."""
seen: set[str] = set()
resolved = []
for name in column_names:
if any(c in name for c in "*?["):
for match in fnmatch.filter(self._column_configs.keys(), name):
if match not in seen:
seen.add(match)
resolved.append(match)
elif name in self._column_configs and name not in seen:
seen.add(name)
resolved.append(name)
return resolved
```
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary
Fixes two bugs in
DropColumnsProcessorConfigthat affect notebook workflows: re-runningadd_processorwith the same name now replaces the old config (upsert), and reasoning/trace columns can now be dropped.Fixes #332
🔄 Changes
🐛 Fixed
add_processornow uses upsert semantics — calling it with the same processor name replaces the existing processor and reverts staledrop=Trueflags on columns, making notebook cells safely re-runnablevalidate_drop_columns_processornow includes side-effect columns (__reasoning_content,__trace) in the set of valid column names, so reasoning columns can be dropped without validation errors🧪 Tests
TestAddProcessorIdempotent: 3 tests covering upsert-replaces-by-name, different-names-append, and non-drop-processor replacementtest_validate_drop_columns_processor_accepts_reasoning_columns: reasoning column accepted whenextract_reasoning_content=Truetest_validate_drop_columns_processor_rejects_invalid_side_effect_column: still rejects__reasoning_contentwhen the flag is not enabled🔍 Attention Areas
config_builder.py#_remove_processor_by_name— New private method that removes an existing processor and undoes itsdrop=Trueside-effects. Verify the revert logic is correct when aDropColumnsProcessorlisted columns that are not in_column_configs(e.g., reasoning columns).🤖 Generated with AI