Skip to content

Commit e2a33f9

Browse files
committed
Adding all files
1 parent 3528c1e commit e2a33f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2011
-0
lines changed

README.md

+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Document-level membership inference for Large Language Models
2+
3+
Given black-box access to a pretrained large language model, can we predict whether a document has been part of its training dataset?
4+
5+
This repo contains the source code to generate the results as published in the paper ["Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models"](https://arxiv.org/pdf/2310.15007).
6+
7+
## 1. Install environment
8+
9+
Follow these steps to install the correct python environment:
10+
- `conda create --name doc_membership python=3.9`
11+
- `conda activate doc_membership`
12+
- `pip install -r requirements.txt`
13+
14+
## 2. Model setup
15+
16+
We now download the target model we consider. Use `python src/split_chunks.py` or `scripts/download_model.sh` to do so for the desired model on Hugging Face. In the paper we used [OpenLLaMA](https://huggingface.co/openlm-research/open_llama_7b).
17+
18+
## 2. Dataset setup
19+
20+
First and foremost, textual data should be collected and split in 'member' and 'non member' documents. In this project both books from Project Gutenberg and academic papers from ArXiv have been considered.
21+
22+
To reproduce the data collection we rely on the data download and preprocess scripts provided by RedPajama (their first version, so now an older branch [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1)). More specifically, we applied the following strategy for both data sources:
23+
- Books Project Gutenberg.
24+
- Members: we just downloaded PG-19 from Hugging Face, as for instance [here](https://huggingface.co/datasets/deepmind/pg19).
25+
- Non-members: we used public code to scrape books from Project Gutenberg using [this code](https://github.com/kpully/gutenberg_scraper). You can find the scripts we utilized to do so in `data/raw_gutenberg/`. Note that the book index to start from was manually searched from [Project Gutenberg](https://www.gutenberg.org/).
26+
- Academic papers from ArXiv.
27+
- Members: we download all `jsonl` files as provided by the V1 version of RedPajama. For all details see `data/raw_arxiv/`.
28+
- Non-members: we download all ArXiv papers at a small cost using the resources ArXiv provides [here](https://info.arxiv.org/help/bulk_data_s3.html) and the script to do so [here](https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/run_download.py).
29+
- All preprocessing for ArXiv has been done using [this script](https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/run_clean.py).
30+
31+
Next, we also tokenize the data using `python src/tokenize_data.py` or `scripts/tokenize_data.sh`.
32+
33+
Lastly, we create 'chunks' of documents, enabling us to run the entire pipeline multiple times (training on k-1 chunks and evaluating on the heldout chunk, repeating this k times.)
34+
For this we use `python src/split_chunks.py -c config/SOME_CONFIG.ini` with the appropriate input arguments.
35+
36+
## 4. Computing the perplexity for all chunks
37+
38+
We will now query the downloaded language model while running through each document, computing for each token its predicted probability and the top probabilities.
39+
For this we use `python src/compute_perplexity.py` with the appropriate input arguments as in `scripts/compute_perplexity.sh`. Using GPUs is recommended for this.
40+
The resulting token-level values are saved in `perplexity_results/`.
41+
42+
At the same time, the general probability for each token and token frequency in the overall set of documents is computed and saved.
43+
44+
## 5. Training and evaluating the meta-classifier for membership prediction
45+
46+
We run this with `python main.py -c config/SOME_CONFIG.ini`, where the exact setup should be specified in the config file (such as the path to perplexity results, the normalization type, meta-classifier type etc).
47+
The evaluation results are then saved in `classifier_results/`. The folder `./config/` contains all setups used to generate the results in the paper (for one dataset, i.e. books).
48+
49+
## 6. Compute baselines
50+
51+
We also provided the code we used to compute the baselines. For this we use `python src/compute_baselines.py` with the appropriate input arguments as in `scripts/compute_baselines.sh`. Note that the code comes from Shi et al. [here](https://github.com/swj0419/detect-pretrain-code).
52+
53+
For the neighborhood baseline as introduced by Mattern et al., we adapt [their code](https://github.com/mireshghallah/neighborhood-curvature-mia) to `src/compute_baselines.py` and `scripts/compute_neighborhood_baselines.sh`. Note that its input requires a pickle file, but this could be easily adapted if needed. `
54+
55+
## 7. Citation
56+
57+
If you found this code helpful for your research, kindly cite our work:
58+
59+
```
60+
@article{meeus2023did,
61+
title={Did the neurons read your book? document-level membership inference for large language models},
62+
author={Meeus, Matthieu and Jain, Shubham and Rei, Marek and de Montjoye, Yves-Alexandre},
63+
journal={arXiv preprint arXiv:2310.15007},
64+
year={2023}
65+
}
66+
```

classifier_results/.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Ignore everything in this directory
2+
*
3+
# Except this file
4+
!.gitignore
5+
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='arxiv_128_just_token_freq_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks/arxiv_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks/arxiv_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/perplexity_open_llama_7b_open_llama_7b_arxiv_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks/token_freq/token_freq_arxiv_XX.pickle'
8+
norm_type='just_norm_val'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'

config/books_128_base.ini

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_base'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='none'
9+
feat_extraction_type='simple_agg'
10+
models='logistic_regression,random_forest'

config/books_128_base_hist.ini

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_base_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='none'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_general_proba_diff_max_agg'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='diff_max_token_proba'
9+
feat_extraction_type='simple_agg'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_general_proba_diff_max_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='diff_max_token_proba'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_general_proba_norm_agg'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='ratio'
9+
feat_extraction_type='simple_agg'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_general_proba_norm_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
8+
norm_type='ratio'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_just_token_freq_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
8+
norm_type='just_norm_val'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_token_freq_diff_max_agg'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
8+
norm_type='diff_max_token_proba'
9+
feat_extraction_type='simple_agg'
10+
models='logistic_regression,random_forest'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_token_freq_diff_max_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
8+
norm_type='diff_max_token_proba'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_token_freq_norm_agg'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
8+
norm_type='ratio'
9+
feat_extraction_type='simple_agg'
10+
models='logistic_regression,random_forest'
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
experiment_name='books_128_token_freq_norm_hist'
2+
output_dir='./classifier_results/chunks/'
3+
n_chunks=5
4+
path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
5+
path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
6+
path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
7+
path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
8+
norm_type='ratio'
9+
feat_extraction_type='hist_1000'
10+
models='logistic_regression,random_forest'

config/split_chunks_books.ini

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
prefix='gutenberg'
2+
output_dir='data/final_chunks'
3+
path_to_member_data='./data/tokenized/open_llama_7b/pg19'
4+
path_to_non_member_data='./data/tokenized/open_llama_7b/gutenberg_non_member_second_run_all'
5+
min_tokens=5000
6+
n_chunks=5
7+
n_pos_chunk=200
8+
seed=42

data/.gitignore

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Ignore everything in this directory
2+
*
3+
# Except these files
4+
!.gitignore
5+
!raw_gutenberg/
6+
!raw_arxiv_redpajama/
7+
8+
# Specifically unignore the files in those directories
9+
raw_gutenberg/*
10+
raw_arxiv_redpajama/*
11+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import json
2+
import os
3+
from datasets import Dataset
4+
from tqdm import tqdm
5+
6+
# Directory containing your jsonl files
7+
files_directory = "XX"
8+
path_to_result = "XX"
9+
10+
# List to store dataset entries
11+
dataset_entries = []
12+
13+
# Loop through each file in the directory
14+
for filename in tqdm(os.listdir(files_directory)):
15+
if filename.endswith(".jsonl"):
16+
# let's extract the entries in the jsonl file
17+
with open(filename, 'r') as json_file:
18+
json_list = list(json_file)
19+
20+
# let's now add the data
21+
for json_str in tqdm(json_list):
22+
try:
23+
paper = json.loads(json_str)
24+
dataset_entries.append(paper)
25+
except Exception as e:
26+
print(e)
27+
28+
print('Number of arxiv papers: ', len(dataset_entries))
29+
30+
# Create the dataset
31+
dataset = Dataset.from_dict({"meta": [entry["meta"] for entry in dataset_entries],
32+
"text": [entry["text"] for entry in dataset_entries]})
33+
34+
# Save the dataset
35+
dataset.save_to_disk(path_to_result)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
# Read each URL from url_list.txt and download the corresponding JSON file
4+
while read -r url; do
5+
echo "Downloading $url..."
6+
curl -O "$url"
7+
done < urls.txt

0 commit comments

Comments
 (0)