Predicting Emergence

For a visual introduction, please look at these introduction slides

People consistently debate whether language models display "emergent abilities", but it's difficult to define the phenomenon in the first place. One simple definition is whether a model's next token distribution predictably changes as we scale up the model's parameters and keep everything else constant. Interestingly, this lends itself to a clean emergence test: can one predict a larger model's activations from smaller models with the same training algorithm and data distribution?

In this challenge, you will put this definition to the test! Thanks to EleutherAI's heroic effort with the Pythia model series, we have access to language models trained across various parameter counts using the same training and data recipe. We formulate our concrete problem as the following

"Given the next-token probability distribution for a prefix for Pythia-{70m, 160m, 410m, 1b, 1.4b, 2.8b}, can you predict the most likely token given by Pythia-12b?"

This repo provides such datasets for 100 prefixes of the GLUE RTE dataset and 100 prefixes of the Wikipedia dataset. These datasets are provided as pickle files glue-rte.pkl and wikipedia.pkl. Each pickle file contains a tuple of the following information

model_names: names of the models we have logits for (our files have 7)
sentences: all the prefixes we consider (our files have 100)
probs_per_sentence: a (NUM_SENTENCES x NUM_MODELS x NUM_TOKENS) array, where probs_per_sentence[i][j][k] has, for prefix sentences[i], the probability model_names[j] assigns token k (our arrays are 100 x 7 x 50277)

For the specific construction process and format, refer to how they are generated/saved in preprocess_train.py and how they are loaded in challenge.py.

If you think you have a function that can predict the next-token prediction distribution, put your idea to the test by writing a strategy in challenge.py! If you're happy with the train performance of your method, email the author at [email protected] for a held-out test distribution. After running this challenge at my research group's weekly meeting, I have seen some initial strategies and the "SOTA" strategies happen to be quite heuristic. If you perform better than these strategies, I have some available prizes; if you perform particularly strong, we can get a research paper out of this :)) Note that you shouldn't feel limited to the scope, and I'd be excited to see changes that involve changing/increasing the training data or altering the prediction objective in a compelling way.

This is meant to be a hard (and likely impossible) challenge, so don't get upset if you make little progress. Hopefully, this builds some intuition for how language models behave across scale in a fun challenge. Best of luck!

Setup

Running challenge.py requires only a few standard packages such as torch, huggingface, etc. All code in this repository was executed with the following conda environment

conda env create -f environment.yml

[Optional] This repo provides two next-token distribution datasets in pickle files. If you want to generate your own, modify preprocess_train.py and run

python preprocess_train.py

To test a strategy, add it to challenge.py, specify the desired pickle file in DATASET, and run

python challenge.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
challenge.py		challenge.py
environment.yml		environment.yml
glue-rte.pkl		glue-rte.pkl
introduction.pdf		introduction.pdf
preprocess_train.py		preprocess_train.py
wikipedia.pkl		wikipedia.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Emergence

Setup

About

Releases

Packages

Languages

kothasuhas/predicting-emergence

Folders and files

Latest commit

History

Repository files navigation

Predicting Emergence

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages