-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not instantiate an infinite EpochDataset from current config access #884
Comments
As a lazy workaround, mixture datasets always loop so a singleton mixture
will epoch forever.
…On Mon, Feb 10, 2025 at 2:57 PM Suhas Kotha ***@***.***> wrote:
I want to instantiate an infinite EpochDataset. This is done by passing
max_epochs=None. When I look at LMDatasetConfig, I notice that if
train_set receives an epochs argument. However, if this epochs argument
is None, then the EpochDataset is never instantiated, limiting me to the
standard dataset.
https://github.com/stanford-crfm/levanter/blob/b6dc7d41c537363ab3206c5f7840132a81888710/src/levanter/data/text.py#L1045-L1062
I think the interface needs to be changed. I can see two easy fixes
- By default, infinite epoch any finite dataset
- Have a separate flag for no epoching vs infinite epoching
cc @Helw150 <https://github.com/Helw150>
—
Reply to this email directly, view it on GitHub
<#884>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACLIJYMODH7576AWJF55T2PEVGTAVCNFSM6AAAAABW3WUPRCVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2DGOJQG42TQNI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yeah, creating a mixture dataset with data mix weights 1 would support this! |
I see, my use case was epoching one dataset many times while going one pass over the other dataset. If the mixture dataset gets arbitrarily looped, I don't need this as long as I crop my repetition dataset to the correct number of sequences (which I'm manually doing by modifying levanter right now). |
@kothasuhas If I understand correctly, you should be able to set up an experiment that supports that without modifying Levanter now! You should be able to just call dataset.slice_dataset(num_sequences) which will return a slice of only a fixed number of seqs https://github.com/stanford-crfm/levanter/blob/main/src%2Flevanter%2Fdata%2Fdataset.py#L381-L387 |
I'm currently editing a local copy of levanter to do the slicing for me in LMMixtureDatasetConfig. Regardless, it works, and I imagine there's little use case for infinite epoching if the LMMixtureDatasetConfig does infinite looping by default. Thanks for clarifications! |
I want to instantiate an infinite EpochDataset. This is done by passing
max_epochs=None
. When I look at LMDatasetConfig, I notice thattrain_set
receives anepochs
argument determining the number of epochs. However, if thisepochs
argument isNone
, then theEpochDataset
is never instantiated, limiting me to the standard dataset.levanter/src/levanter/data/text.py
Lines 1045 to 1062 in b6dc7d4
I think the interface needs to be changed. I can see two easy fixes
cc @Helw150
The text was updated successfully, but these errors were encountered: