Can not instantiate an infinite EpochDataset from current config access #884

kothasuhas · 2025-02-10T22:57:23Z

I want to instantiate an infinite EpochDataset. This is done by passing max_epochs=None. When I look at LMDatasetConfig, I notice that train_set receives an epochs argument determining the number of epochs. However, if this epochs argument is None, then the EpochDataset is never instantiated, limiting me to the standard dataset.

levanter/src/levanter/data/text.py

Lines 1045 to 1062 in b6dc7d4

    
           def train_set( 
        
               self, 
        
               seq_len: int, 
        
               monitors: Union[bool, List[MetricsMonitor]] = True, 
        
               *, 
        
               key: Optional[PRNGKeyArray] = None, 
        
               epochs: Optional[int] = None, 
        
           ) -> AsyncDataset[np.ndarray]: 
        
               ds: AsyncDataset[np.ndarray] | None = self.token_seq_dataset("train", seq_len, monitors) 
        
               # add epoch flag here. 
        
               if ds is None: 
        
                   raise ValueError("No training set!") 
        
               if epochs: 
        
                   logger.info("Wrapping dataset in epoch dataset") 
        
                   ds = EpochDataset(ds, max_epochs=epochs)

I think the interface needs to be changed. I can see two easy fixes

By default, infinite epoch any finite dataset
Have a separate flag for no epoching vs infinite epoching

cc @Helw150

The text was updated successfully, but these errors were encountered:

dlwh · 2025-02-11T01:29:27Z

As a lazy workaround, mixture datasets always loop so a singleton mixture will epoch forever.

…

On Mon, Feb 10, 2025 at 2:57 PM Suhas Kotha ***@***.***> wrote: I want to instantiate an infinite EpochDataset. This is done by passing max_epochs=None. When I look at LMDatasetConfig, I notice that if train_set receives an epochs argument. However, if this epochs argument is None, then the EpochDataset is never instantiated, limiting me to the standard dataset. https://github.com/stanford-crfm/levanter/blob/b6dc7d41c537363ab3206c5f7840132a81888710/src/levanter/data/text.py#L1045-L1062 I think the interface needs to be changed. I can see two easy fixes - By default, infinite epoch any finite dataset - Have a separate flag for no epoching vs infinite epoching cc @Helw150 <https://github.com/Helw150> — Reply to this email directly, view it on GitHub <#884>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACLIJYMODH7576AWJF55T2PEVGTAVCNFSM6AAAAABW3WUPRCVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2DGOJQG42TQNI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Helw150 · 2025-02-11T01:40:05Z

Yeah, creating a mixture dataset with data mix weights 1 would support this!

kothasuhas · 2025-02-11T03:29:39Z

I see, my use case was epoching one dataset many times while going one pass over the other dataset. If the mixture dataset gets arbitrarily looped, I don't need this as long as I crop my repetition dataset to the correct number of sequences (which I'm manually doing by modifying levanter right now).

Helw150 · 2025-02-12T17:13:59Z

@kothasuhas If I understand correctly, you should be able to set up an experiment that supports that without modifying Levanter now!

You should be able to just call dataset.slice_dataset(num_sequences) which will return a slice of only a fixed number of seqs

https://github.com/stanford-crfm/levanter/blob/main/src%2Flevanter%2Fdata%2Fdataset.py#L381-L387

kothasuhas · 2025-02-13T09:10:43Z

I'm currently editing a local copy of levanter to do the slicing for me in LMMixtureDatasetConfig. Regardless, it works, and I imagine there's little use case for infinite epoching if the LMMixtureDatasetConfig does infinite looping by default. Thanks for clarifications!

kothasuhas added the bug Something isn't working label Feb 10, 2025

kothasuhas closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not instantiate an infinite EpochDataset from current config access #884

Can not instantiate an infinite EpochDataset from current config access #884

kothasuhas commented Feb 10, 2025 •

edited

Loading

dlwh commented Feb 11, 2025 via email

Helw150 commented Feb 11, 2025

kothasuhas commented Feb 11, 2025

Helw150 commented Feb 12, 2025

kothasuhas commented Feb 13, 2025

Can not instantiate an infinite EpochDataset from current config access #884

Can not instantiate an infinite EpochDataset from current config access #884

Comments

kothasuhas commented Feb 10, 2025 • edited Loading

dlwh commented Feb 11, 2025 via email

Helw150 commented Feb 11, 2025

kothasuhas commented Feb 11, 2025

Helw150 commented Feb 12, 2025

kothasuhas commented Feb 13, 2025

kothasuhas commented Feb 10, 2025 •

edited

Loading