purple-problem

You can't stop a language model from saying purple 🤷

Install

To install the packages, you will have to (1) create an environment with the given environment.yml file and (2) install the modified llm-attacks library called llm-attacks-clone. llm-attacks-clone is a modified version of the llm-attacks repository that is edited to optimize GCG strings targeting 'Purple' with the corresponding prompt templates for each model.

Here is how to install the environment:

conda env create -f environment.yml

And here is how to install llm-attacks-clone within the environment:

cd llm-attacks-clone
pip install .

Purple Questions Dataset

datasets/ contains the Purple Questions dataset train, validation, and test splits in json. Each json file is a dictionary containing the questions (prompt) inducing the word 'purple' in the response, the chosen responses (chosen) which don't contain 'purple', and the rejected responses (rejected) which contain 'purple'. You can optionally create your own dataset using create_dataset.py with the desired flags and your OpenAI API key.

datasets/paraphrased contains prompts after being paraphrased in different ways for the paraphrase defense. You can optionally paraphrase your own prompts using [TODO].

Models

released_models contains the fine-tuned and adversarially trained models on the Purple Questions Dataset for Llama-IT, Vicuna, and Llama-2-chat as discussed in the paper. These are LoRA adapters that are loaded on top of the base models. The base model for Llama-IT is the sft10k model from Alpaca Farm which is not provided here and must be manually downloaded.

Fine-tuning

To fine-tune a model through DPO, run train_dpo.py with the required arguments. Here is an example for training Vicuna 7B from huggingface:

python train_dpo.py --base_model lmsys/vicuna-7b-v1.5 --learning_rate 3e-4 --kl_coef 0.3 --epochs 5

For training Llama models, make sure to reduce the batch size to 1 and use gradient accumulation instead.

python train_dpo.py --base_model meta-llama/Llama-2-7b-chat-hf --learning_rate 3e-4 --kl_coef 0.3 --epochs 5 --batch_size 1 --grad_accum 4

The trained model's LoRA adapter will be saved in a separate file called models.

Adversarial Training

To adversarially train a model, run train_dpo.py with a suffix json file passed as an argument. By default, this will append the selected adversarial suffixes to 50% of the prompts before training.

python train_dpo.py -bm lmsys/vicuna-7b-v1.5 -lr 3e-4 -kl 0.3 -e 5 -suf suffix/vicuna_suffix_train.json

GCG Optimization

suffix/ contains the adversarial suffixes optimized through GCG against our released models used in the paper. Each train set has 20 suffixes used for adversarial training while each validation set has 10 suffixes. gcg_suffix.json contains the corresponding string optimized on each model which results in the reported DSR (Defense Success Rate) for reproduction.

To optimize your own adversarial suffix, run optimize_gcg.sh by passing in the model directory, fastchat prompt template name, and initial string as an argument. To optimize against a different base model with a new template, you will have to modify llm-attacks-clone/llm_attacks/base/attack_manager.py.

Fastchat prompt template used for each model:

Llama-IT: alpaca

Vicuna: vicuna_v1.1

Llama-2-chat: llama-2

Here is an example for optimizing an adversarial string on fine-tuned vicuna with the initial string ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !:

bash optimize_gcg.sh released_models/vicuna-finetune vicuna_v1.1 '! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !'

Evaluate Safety

To evaluate a model, you can specify your desired attack method, defense method, and suffixes and run the following command

python3 evaluate.py --base_model released_models/vicuna-finetune --attack_system gcg --defense_system none --suffixes dpo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

purple-problem

Contents

Install

Purple Questions Dataset

Models

Fine-tuning

Adversarial Training

GCG Optimization

Evaluate Safety

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
datasets		datasets
gcg_results		gcg_results
llm-attacks-clone		llm-attacks-clone
released_models		released_models
slurmjobs		slurmjobs
suffix		suffix
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dataset.py		create_dataset.py
environment.yml		environment.yml
evaluate.py		evaluate.py
optimize_gcg.sh		optimize_gcg.sh
train_dpo.py		train_dpo.py

License

kothasuhas/purple-problem

Folders and files

Latest commit

History

Repository files navigation

purple-problem

Contents

Install

Purple Questions Dataset

Models

Fine-tuning

Adversarial Training

GCG Optimization

Evaluate Safety

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages