Skip to content

predict_as_dataframe iterates the dataloader twice, causing wasted I/O and silent misalignment risk #880

@sevmag

Description

@sevmag

Problem

EasySyntax.predict_as_dataframe currently makes two full passes over the user's DataLoader:

  1. Trainer.predict(self, dataloader) — produces the model outputs.
  2. A subsequent for batch in dataloader: ... loop in predict_as_dataframe itself — pulls additional_attributes (e.g. event_no, energy, azimuth) out of each batch and concatenates them.

The two passes are then assumed to align row-for-row by index. This has several real downsides:

1. Wasted I/O and time

The second loop re-reads every event from the underlying SQLite/Parquet/HDF5 file just to grab a handful of scalar columns. For large inference jobs this can roughly double the wall-clock time of predict_as_dataframe, with all of the extra cost spent on disk reads and collate_fn work that the model already paid for.

2. Sampler restriction is a workaround, not a fix

Because alignment depends on both passes producing the same batch order, the current code raises if the loader's sampler is not a SequentialSampler:

DataLoader has a sampler that is not SequentialSampler, indicating that shuffling is enabled. (...) Either call this method a dataloader which doesn't resample batches; or do not request additional_attributes.

This forces users to construct a second, non-shuffled DataLoader for evaluation — even when their existing one would work fine semantically — and silently rules out anything with stochastic sampling, weighted sampling, or distributed samplers.

3. Silent misalignment when the loader is non-deterministic in practice

The SequentialSampler check is necessary but not sufficient. Anything that makes the second pass differ from the first — e.g. a collate_fn that filters events, multi-worker loading where workers were re-seeded, an underlying dataset whose __getitem__ is not pure (re-shuffled per epoch, augmentations with a fresh RNG) — produces a DataFrame whose additional_attributes columns are misaligned with the predictions, with no error raised. This is a particularly nasty failure mode because the resulting DataFrame still looks valid and downstream analysis just produces wrong physics.

4. Pulse-level / multi-dim attributes are awkward

The current code repeats event-level attributes by batch.n_pulses after the fact, and multi-dimensional attributes (e.g. direction = (x, y, z)) silently get flattened by np.asarray(values)[:, np.newaxis] in a way that doesn't produce sensible column names.

Proposed fix

Gather additional_attributes inside predict_step, in the same forward pass as the predictions. The trainer already iterates the loader once; emitting the requested batch fields alongside the model output is essentially free, and alignment is guaranteed by construction (predictions and attributes for batch i come out of the same batch object).

Concretely:

  • predict_step returns [*task_outputs, *attribute_arrays].
  • predict accepts additional_attributes=... and returns List[Union[Tensor, np.ndarray]] split on the same boundary.
  • predict_as_dataframe becomes a thin wrapper that just stitches the columns together, drops the SequentialSampler check, and supports multi-dim attributes by expanding to <name>_0, <name>_1, ... columns.

This:

  • Removes the second pass (≈2× speedup on I/O-bound inference).
  • Makes shuffled / weighted / distributed samplers safe to use.
  • Eliminates the silent-misalignment failure mode.

A draft PR implementing this is up at #879.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions