predict_as_dataframe iterates the dataloader twice, causing wasted I/O and silent misalignment risk

## Problem

`EasySyntax.predict_as_dataframe` currently makes **two** full passes over the user's `DataLoader`:

1. `Trainer.predict(self, dataloader)` — produces the model outputs.
2. A subsequent `for batch in dataloader: ...` loop in `predict_as_dataframe` itself — pulls `additional_attributes` (e.g. `event_no`, `energy`, `azimuth`) out of each batch and concatenates them.

The two passes are then assumed to align row-for-row by index. This has several real downsides:

### 1. Wasted I/O and time
The second loop re-reads every event from the underlying SQLite/Parquet/HDF5 file just to grab a handful of scalar columns. For large inference jobs this can roughly double the wall-clock time of `predict_as_dataframe`, with all of the extra cost spent on disk reads and `collate_fn` work that the model already paid for.

### 2. Sampler restriction is a workaround, not a fix
Because alignment depends on both passes producing the same batch order, the current code raises if the loader's sampler is not a `SequentialSampler`:

> *DataLoader has a `sampler` that is not `SequentialSampler`, indicating that shuffling is enabled. (...) Either call this method a `dataloader` which doesn't resample batches; or do not request `additional_attributes`.*

This forces users to construct a **second**, non-shuffled `DataLoader` for evaluation — even when their existing one would work fine semantically — and silently rules out anything with stochastic sampling, weighted sampling, or distributed samplers.

### 3. Silent misalignment when the loader is non-deterministic in practice
The `SequentialSampler` check is necessary but not sufficient. Anything that makes the second pass differ from the first — e.g. a `collate_fn` that filters events, multi-worker loading where workers were re-seeded, an underlying dataset whose `__getitem__` is not pure (re-shuffled per epoch, augmentations with a fresh RNG) — produces a DataFrame whose `additional_attributes` columns are **misaligned with the predictions**, with no error raised. This is a particularly nasty failure mode because the resulting DataFrame still looks valid and downstream analysis just produces wrong physics.

### 4. Pulse-level / multi-dim attributes are awkward
The current code repeats event-level attributes by `batch.n_pulses` after the fact, and multi-dimensional attributes (e.g. `direction = (x, y, z)`) silently get flattened by `np.asarray(values)[:, np.newaxis]` in a way that doesn't produce sensible column names.

## Proposed fix

Gather `additional_attributes` **inside** `predict_step`, in the same forward pass as the predictions. The trainer already iterates the loader once; emitting the requested batch fields alongside the model output is essentially free, and alignment is guaranteed by construction (predictions and attributes for batch *i* come out of the same `batch` object).

Concretely:

- `predict_step` returns `[*task_outputs, *attribute_arrays]`.
- `predict` accepts `additional_attributes=...` and returns `List[Union[Tensor, np.ndarray]]` split on the same boundary.
- `predict_as_dataframe` becomes a thin wrapper that just stitches the columns together, drops the `SequentialSampler` check, and supports multi-dim attributes by expanding to `<name>_0`, `<name>_1`, ... columns.

This:
- Removes the second pass (≈2× speedup on I/O-bound inference).
- Makes shuffled / weighted / distributed samplers safe to use.
- Eliminates the silent-misalignment failure mode.

A draft PR implementing this is up at #879.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict_as_dataframe iterates the dataloader twice, causing wasted I/O and silent misalignment risk #880

Problem

1. Wasted I/O and time

2. Sampler restriction is a workaround, not a fix

3. Silent misalignment when the loader is non-deterministic in practice

4. Pulse-level / multi-dim attributes are awkward

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

predict_as_dataframe iterates the dataloader twice, causing wasted I/O and silent misalignment risk #880

Description

Problem

1. Wasted I/O and time

2. Sampler restriction is a workaround, not a fix

3. Silent misalignment when the loader is non-deterministic in practice

4. Pulse-level / multi-dim attributes are awkward

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions