Supporting varied mixtures over training #868

kothasuhas · 2025-01-29T01:42:33Z

Description

Currently, LM mixture dataset can only handle a static mixture over the course of training. This PR enables varying this mixture over datasets over the course of training. The user can now specify a list of stages and the sequence index at which each should start.

Internally, we identify a training block to its stage, which defines its mixing weights. To efficiently translate a data point's index within a block to a respective source dataset, we precompute prefix sums that track how many data points are seen by previous stages.

Fixes Issues

https://github.com/stanford-crfm/marin/issues/81

Unit test coverage

There are new unit tests in test_varying_mixture.py to ensure that the varying mixture behaves as expected.

Known breaking changes/behaviors

The design enables traditional usage of the MixtureDataset class. However, some of the private quantities are different (i.e. the expected counts per block now depends on the block and is not a member variable). To my knowledge, these variables are not accessed outside of tests.

Additional context

I have some changes I want to make Marin to enable usage of this new functionality, though these updates are modular and can be seperate PR's. I have spot-checked that training proceeds as expected with this test. This is my first PR so feedback is appreciated :))

…steps

dlwh

awesome! thanks for knocking this out so quickly

dlwh · 2025-01-29T04:19:54Z

src/levanter/data/mixture.py


    Args:
        datasets: A dict of datasets, where the key is the name of the dataset and the value is the dataset itself
-        weights: weights for each dataset
+        weights: Weights for each dataset. This can be provided in a list of stages, where each stage is a tuple of (start_index, weights).


Suggested change

weights: Weights for each dataset. This can be provided in a list of stages, where each stage is a tuple of (start_index, weights).

weights: Weights for each dataset. This can be provided in a list of stages, where each stage is a tuple of (start_step, weights).

changed. also added clarification that this corresponds to the sequence index at which you want to change the distribution, not the batch index. this method doesnt get to know the batch size and i think that generality is good (eventually if you want batch size curricula like most real LMs)

actually because of this, im thinking its better to keep it as "start_seq_index"?

src/levanter/data/mixture.py

src/levanter/data/text.py

kothasuhas · 2025-01-29T05:18:39Z

thanks for review! only final request is to name the variable as start_seq_index instead of start_step since step is often conflated with batch indices. and for maximum generality, i wanted this method to work without knowing the batch size. if this is good, ill merge

dlwh · 2025-01-29T06:36:24Z

awesome thanks!

kothasuhas added 2 commits January 28, 2025 17:22

supporting multiple stages of training via list of stages with start …

2035458

…steps

updating tests

d5b33d5

kothasuhas requested a review from dlwh January 29, 2025 01:42

set maxsize on alru cache

1b506ca

dlwh approved these changes Jan 29, 2025

View reviewed changes

addressing feedback

e340079

dlwh merged commit d9a0d57 into main Jan 29, 2025
7 of 8 checks passed

dlwh deleted the suhas/varying-mixture branch January 29, 2025 06:36

kothasuhas mentioned this pull request Feb 3, 2025

Current batch indexing does not interact nicely with varying mixtures #876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting varied mixtures over training #868

Supporting varied mixtures over training #868

kothasuhas commented Jan 29, 2025

dlwh left a comment

dlwh Jan 29, 2025

kothasuhas Jan 29, 2025

kothasuhas Jan 29, 2025 •

edited

Loading

kothasuhas commented Jan 29, 2025

dlwh commented Jan 29, 2025

	weights: Weights for each dataset. This can be provided in a list of stages, where each stage is a tuple of (start_index, weights).
	weights: Weights for each dataset. This can be provided in a list of stages, where each stage is a tuple of (start_step, weights).

Supporting varied mixtures over training #868

Supporting varied mixtures over training #868

Conversation

kothasuhas commented Jan 29, 2025

Description

Fixes Issues

Unit test coverage

Known breaking changes/behaviors

Additional context

dlwh left a comment

Choose a reason for hiding this comment

dlwh Jan 29, 2025

Choose a reason for hiding this comment

kothasuhas Jan 29, 2025

Choose a reason for hiding this comment

kothasuhas Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

kothasuhas commented Jan 29, 2025

dlwh commented Jan 29, 2025

kothasuhas Jan 29, 2025 •

edited

Loading