Problem
EasySyntax.predict_as_dataframe currently makes two full passes over the user's DataLoader:
Trainer.predict(self, dataloader) — produces the model outputs.
- A subsequent
for batch in dataloader: ... loop in predict_as_dataframe itself — pulls additional_attributes (e.g. event_no, energy, azimuth) out of each batch and concatenates them.
The two passes are then assumed to align row-for-row by index. This has several real downsides:
1. Wasted I/O and time
The second loop re-reads every event from the underlying SQLite/Parquet/HDF5 file just to grab a handful of scalar columns. For large inference jobs this can roughly double the wall-clock time of predict_as_dataframe, with all of the extra cost spent on disk reads and collate_fn work that the model already paid for.
2. Sampler restriction is a workaround, not a fix
Because alignment depends on both passes producing the same batch order, the current code raises if the loader's sampler is not a SequentialSampler:
DataLoader has a sampler that is not SequentialSampler, indicating that shuffling is enabled. (...) Either call this method a dataloader which doesn't resample batches; or do not request additional_attributes.
This forces users to construct a second, non-shuffled DataLoader for evaluation — even when their existing one would work fine semantically — and silently rules out anything with stochastic sampling, weighted sampling, or distributed samplers.
3. Silent misalignment when the loader is non-deterministic in practice
The SequentialSampler check is necessary but not sufficient. Anything that makes the second pass differ from the first — e.g. a collate_fn that filters events, multi-worker loading where workers were re-seeded, an underlying dataset whose __getitem__ is not pure (re-shuffled per epoch, augmentations with a fresh RNG) — produces a DataFrame whose additional_attributes columns are misaligned with the predictions, with no error raised. This is a particularly nasty failure mode because the resulting DataFrame still looks valid and downstream analysis just produces wrong physics.
4. Pulse-level / multi-dim attributes are awkward
The current code repeats event-level attributes by batch.n_pulses after the fact, and multi-dimensional attributes (e.g. direction = (x, y, z)) silently get flattened by np.asarray(values)[:, np.newaxis] in a way that doesn't produce sensible column names.
Proposed fix
Gather additional_attributes inside predict_step, in the same forward pass as the predictions. The trainer already iterates the loader once; emitting the requested batch fields alongside the model output is essentially free, and alignment is guaranteed by construction (predictions and attributes for batch i come out of the same batch object).
Concretely:
predict_step returns [*task_outputs, *attribute_arrays].
predict accepts additional_attributes=... and returns List[Union[Tensor, np.ndarray]] split on the same boundary.
predict_as_dataframe becomes a thin wrapper that just stitches the columns together, drops the SequentialSampler check, and supports multi-dim attributes by expanding to <name>_0, <name>_1, ... columns.
This:
- Removes the second pass (≈2× speedup on I/O-bound inference).
- Makes shuffled / weighted / distributed samplers safe to use.
- Eliminates the silent-misalignment failure mode.
A draft PR implementing this is up at #879.
Problem
EasySyntax.predict_as_dataframecurrently makes two full passes over the user'sDataLoader:Trainer.predict(self, dataloader)— produces the model outputs.for batch in dataloader: ...loop inpredict_as_dataframeitself — pullsadditional_attributes(e.g.event_no,energy,azimuth) out of each batch and concatenates them.The two passes are then assumed to align row-for-row by index. This has several real downsides:
1. Wasted I/O and time
The second loop re-reads every event from the underlying SQLite/Parquet/HDF5 file just to grab a handful of scalar columns. For large inference jobs this can roughly double the wall-clock time of
predict_as_dataframe, with all of the extra cost spent on disk reads andcollate_fnwork that the model already paid for.2. Sampler restriction is a workaround, not a fix
Because alignment depends on both passes producing the same batch order, the current code raises if the loader's sampler is not a
SequentialSampler:This forces users to construct a second, non-shuffled
DataLoaderfor evaluation — even when their existing one would work fine semantically — and silently rules out anything with stochastic sampling, weighted sampling, or distributed samplers.3. Silent misalignment when the loader is non-deterministic in practice
The
SequentialSamplercheck is necessary but not sufficient. Anything that makes the second pass differ from the first — e.g. acollate_fnthat filters events, multi-worker loading where workers were re-seeded, an underlying dataset whose__getitem__is not pure (re-shuffled per epoch, augmentations with a fresh RNG) — produces a DataFrame whoseadditional_attributescolumns are misaligned with the predictions, with no error raised. This is a particularly nasty failure mode because the resulting DataFrame still looks valid and downstream analysis just produces wrong physics.4. Pulse-level / multi-dim attributes are awkward
The current code repeats event-level attributes by
batch.n_pulsesafter the fact, and multi-dimensional attributes (e.g.direction = (x, y, z)) silently get flattened bynp.asarray(values)[:, np.newaxis]in a way that doesn't produce sensible column names.Proposed fix
Gather
additional_attributesinsidepredict_step, in the same forward pass as the predictions. The trainer already iterates the loader once; emitting the requested batch fields alongside the model output is essentially free, and alignment is guaranteed by construction (predictions and attributes for batch i come out of the samebatchobject).Concretely:
predict_stepreturns[*task_outputs, *attribute_arrays].predictacceptsadditional_attributes=...and returnsList[Union[Tensor, np.ndarray]]split on the same boundary.predict_as_dataframebecomes a thin wrapper that just stitches the columns together, drops theSequentialSamplercheck, and supports multi-dim attributes by expanding to<name>_0,<name>_1, ... columns.This:
A draft PR implementing this is up at #879.