Skip to content

DropColumnsProcessorConfig: not idempotent on re-run and no support for reasoning columns #332

@mvansegbroeck

Description

@mvansegbroeck

Priority Level

Medium (Annoying but has workaround)

Describe the bug

DropColumnsProcessorConfig has two issues in notebook workflows:

  1. Not idempotent: Re-running add_processor with the same name but different column_names does not update the existing processor. The stale config persists.
  2. Cannot use <column_name>__reasoning_content to drop a reasoning column or things like "*__reasoning_content" to drop all reasoning columns at once. The validator rejects it because the literal string doesn't match any column name.

Steps/Code to reproduce bug

Issue 1: Re-running does not update config

config_builder.add_processor(
      dd.DropColumnsProcessorConfig(
          name="cleanup",
          column_names=["col_a"],
      )
  )
data_designer.validate(config_builder)  # OK

Now change column_names and re-run the cell:

config_builder.add_processor(
      dd.DropColumnsProcessorConfig(
          name="cleanup",
          column_names=["col_b"],  # changed
      )
 )
data_designer.validate(config_builder)  # Now drops ["col_a", "col_b"] instead of ["col_b"] only

Issue 2:

  config_builder.add_processor(
      dd.DropColumnsProcessorConfig(
          name="cleanup",
          column_names=["col_a__reasoning_content"],
      )
  )
  data_designer.validate(config_builder)  # Error: column does not exist

Expected behavior

  1. Calling add_processor with the same name should replace the existing processor config (upsert), so notebook cells are safely re-runnable.
  2. column_names should support dropping reasoning columns

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions