computationalprivacy
diff --git a/‎README.md
+66 b/‎README.md
+66
diff --git a/‎classifier_results/.gitignore
+5 b/‎classifier_results/.gitignore
+5
diff --git a/‎config/arxiv_128_just_token_freq_hist.ini
+10 b/‎config/arxiv_128_just_token_freq_hist.ini
+10
diff --git a/‎config/books_128_base.ini
+10 b/‎config/books_128_base.ini
+10
diff --git a/‎config/books_128_base_hist.ini
+10 b/‎config/books_128_base_hist.ini
+10
diff --git a/‎config/books_128_general_proba_diff_max_agg.ini
+10 b/‎config/books_128_general_proba_diff_max_agg.ini
+10
diff --git a/‎config/books_128_general_proba_diff_max_hist.ini
+10 b/‎config/books_128_general_proba_diff_max_hist.ini
+10
diff --git a/‎config/books_128_general_proba_norm_agg.ini
+10 b/‎config/books_128_general_proba_norm_agg.ini
+10
diff --git a/‎config/books_128_general_proba_norm_hist.ini
+10 b/‎config/books_128_general_proba_norm_hist.ini
+10
diff --git a/‎config/books_128_just_token_freq_hist.ini
+10 b/‎config/books_128_just_token_freq_hist.ini
+10
diff --git a/‎config/books_128_token_freq_diff_max_agg.ini
+10 b/‎config/books_128_token_freq_diff_max_agg.ini
+10
diff --git a/‎config/books_128_token_freq_diff_max_hist.ini
+10 b/‎config/books_128_token_freq_diff_max_hist.ini
+10
diff --git a/‎config/books_128_token_freq_norm_agg.ini
+10 b/‎config/books_128_token_freq_norm_agg.ini
+10
diff --git a/‎config/books_128_token_freq_norm_hist.ini
+10 b/‎config/books_128_token_freq_norm_hist.ini
+10
diff --git a/‎config/split_chunks_books.ini
+8 b/‎config/split_chunks_books.ini
+8
diff --git a/‎data/.gitignore
+11 b/‎data/.gitignore
+11
diff --git a/‎data/raw_arxiv_redpajama/create_hf_dataset_arxiv.py
+35 b/‎data/raw_arxiv_redpajama/create_hf_dataset_arxiv.py
+35
diff --git a/‎data/raw_arxiv_redpajama/download_arxiv.sh
+7 b/‎data/raw_arxiv_redpajama/download_arxiv.sh
+7
@@ -0,0 +1,66 @@
+# Document-level membership inference for Large Language Models
+
+Given black-box access to a pretrained large language model, can we predict whether a document has been part of its training dataset? 
+
+This repo contains the source code to generate the results as published in the paper ["Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models"](https://arxiv.org/pdf/2310.15007). 
+
+## 1. Install environment
+
+Follow these steps to install the correct python environment:
+- `conda create --name doc_membership python=3.9`
+- `conda activate doc_membership`
+- `pip install -r requirements.txt`
+
+## 2. Model setup
+
+We now download the target model we consider. Use `python src/split_chunks.py` or `scripts/download_model.sh` to do so for the desired model on Hugging Face. In the paper we used [OpenLLaMA](https://huggingface.co/openlm-research/open_llama_7b).
+
+## 2. Dataset setup
+
+First and foremost, textual data should be collected and split in 'member' and 'non member' documents. In this project both books from Project Gutenberg and academic papers from ArXiv have been considered. 
+
+To reproduce the data collection we rely on the data download and preprocess scripts provided by RedPajama (their first version, so now an older branch [here](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1)). More specifically, we applied the following strategy for both data sources:
+- Books Project Gutenberg. 
+    - Members: we just downloaded PG-19 from Hugging Face, as for instance [here](https://huggingface.co/datasets/deepmind/pg19).
+    - Non-members: we used public code to scrape books from Project Gutenberg using [this code](https://github.com/kpully/gutenberg_scraper). You can find the scripts we utilized to do so in `data/raw_gutenberg/`. Note that the book index to start from was manually searched from [Project Gutenberg](https://www.gutenberg.org/). 
+- Academic papers from ArXiv. 
+    - Members: we download all `jsonl` files as provided by the V1 version of RedPajama. For all details see `data/raw_arxiv/`.
+    - Non-members: we download all ArXiv papers at a small cost using the resources ArXiv provides [here](https://info.arxiv.org/help/bulk_data_s3.html) and the script to do so [here](https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/run_download.py).
+    - All preprocessing for ArXiv has been done using [this script](https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/arxiv/run_clean.py).
+
+Next, we also tokenize the data using `python src/tokenize_data.py` or `scripts/tokenize_data.sh`.
+
+Lastly, we create 'chunks' of documents, enabling us to run the entire pipeline multiple times (training on k-1 chunks and evaluating on the heldout chunk, repeating this k times.)
+For this we use `python src/split_chunks.py -c config/SOME_CONFIG.ini` with the appropriate input arguments. 
+
+## 4. Computing the perplexity for all chunks
+
+We will now query the downloaded language model while running through each document, computing for each token its predicted probability and the top probabilities. 
+For this we use `python src/compute_perplexity.py` with the appropriate input arguments as in `scripts/compute_perplexity.sh`. Using GPUs is recommended for this. 
+The resulting token-level values are saved in `perplexity_results/`. 
+
+At the same time, the general probability for each token and token frequency in the overall set of documents is computed and saved. 
+
+## 5. Training and evaluating the meta-classifier for membership prediction
+
+We run this with `python main.py -c config/SOME_CONFIG.ini`, where the exact setup should be specified in the config file (such as the path to perplexity results, the normalization type, meta-classifier type etc). 
+The evaluation results are then saved in `classifier_results/`. The folder `./config/` contains all setups used to generate the results in the paper (for one dataset, i.e. books). 
+
+## 6. Compute baselines
+
+We also provided the code we used to compute the baselines. For this we use `python src/compute_baselines.py` with the appropriate input arguments as in `scripts/compute_baselines.sh`. Note that the code comes from Shi et al. [here](https://github.com/swj0419/detect-pretrain-code).
+
+For the neighborhood baseline as introduced by Mattern et al., we adapt [their code](https://github.com/mireshghallah/neighborhood-curvature-mia) to `src/compute_baselines.py` and `scripts/compute_neighborhood_baselines.sh`. Note that its input requires a pickle file, but this could be easily adapted if needed. `
+
+## 7. Citation
+
+If you found this code helpful for your research, kindly cite our work: 
+
+```
+@article{meeus2023did,
+  title={Did the neurons read your book? document-level membership inference for large language models},
+  author={Meeus, Matthieu and Jain, Shubham and Rei, Marek and de Montjoye, Yves-Alexandre},
+  journal={arXiv preprint arXiv:2310.15007},
+  year={2023}
+}
+```
@@ -0,0 +1,5 @@
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore
+
@@ -0,0 +1,10 @@
+experiment_name='arxiv_128_just_token_freq_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks/arxiv_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks/arxiv_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/perplexity_open_llama_7b_open_llama_7b_arxiv_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks/token_freq/token_freq_arxiv_XX.pickle'
+norm_type='just_norm_val'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_base'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='none'
+feat_extraction_type='simple_agg'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_base_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='none'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_general_proba_diff_max_agg'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='diff_max_token_proba'
+feat_extraction_type='simple_agg'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_general_proba_diff_max_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='diff_max_token_proba'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_general_proba_norm_agg'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='ratio'
+feat_extraction_type='simple_agg'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_general_proba_norm_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/general_proba/general_proba_gutenberg_7b_XX_128.pickle'
+norm_type='ratio'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_just_token_freq_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
+norm_type='just_norm_val'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_token_freq_diff_max_agg'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
+norm_type='diff_max_token_proba'
+feat_extraction_type='simple_agg'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_token_freq_diff_max_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
+norm_type='diff_max_token_proba'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_token_freq_norm_agg'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
+norm_type='ratio'
+feat_extraction_type='simple_agg'
+models='logistic_regression,random_forest'
@@ -0,0 +1,10 @@
+experiment_name='books_128_token_freq_norm_hist'
+output_dir='./classifier_results/chunks/'
+n_chunks=5
+path_to_raw_data='./data/final_chunks_date_filtered/gutenberg_XX_min_tokens5000_seed42'
+path_to_labels='./data/final_chunks_date_filtered/gutenberg_XX_labels.pickle'
+path_to_perplexity_results='./perplexity_results/books_revisited/perplexity_open_llama_7b_open_llama_7b_gutenberg_XX_min_tokens5000_seed42__400_128_127_seed42.pickle'
+path_to_normalization_dict='./data/final_chunks_date_filtered/token_freq/token_freq_gutenberg_XX.pickle'
+norm_type='ratio'
+feat_extraction_type='hist_1000'
+models='logistic_regression,random_forest'
@@ -0,0 +1,8 @@
+prefix='gutenberg'
+output_dir='data/final_chunks'
+path_to_member_data='./data/tokenized/open_llama_7b/pg19'
+path_to_non_member_data='./data/tokenized/open_llama_7b/gutenberg_non_member_second_run_all'
+min_tokens=5000
+n_chunks=5
+n_pos_chunk=200
+seed=42
@@ -0,0 +1,11 @@
+# Ignore everything in this directory
+*
+# Except these files
+!.gitignore
+!raw_gutenberg/
+!raw_arxiv_redpajama/
+
+# Specifically unignore the files in those directories
+raw_gutenberg/*
+raw_arxiv_redpajama/*
+
@@ -0,0 +1,35 @@
+import json
+import os
+from datasets import Dataset
+from tqdm import tqdm 
+
+# Directory containing your jsonl files
+files_directory = "XX"
+path_to_result = "XX"
+
+# List to store dataset entries
+dataset_entries = []
+
+# Loop through each file in the directory
+for filename in tqdm(os.listdir(files_directory)):
+    if filename.endswith(".jsonl"):
+        # let's extract the entries in the jsonl file
+        with open(filename, 'r') as json_file:
+            json_list = list(json_file)
+
+        # let's now add the data
+        for json_str in tqdm(json_list):
+            try:
+                paper = json.loads(json_str)
+                dataset_entries.append(paper)
+            except Exception as e:
+                print(e)
+
+print('Number of arxiv papers: ', len(dataset_entries))
+
+# Create the dataset
+dataset = Dataset.from_dict({"meta": [entry["meta"] for entry in dataset_entries],
+                             "text": [entry["text"] for entry in dataset_entries]})
+
+# Save the dataset
+dataset.save_to_disk(path_to_result)
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+# Read each URL from url_list.txt and download the corresponding JSON file
+while read -r url; do
+    echo "Downloading $url..."
+    curl -O "$url"
+done < urls.txt