Skip to content

Latest commit

 

History

History
86 lines (59 loc) · 4.33 KB

README.md

File metadata and controls

86 lines (59 loc) · 4.33 KB

SafeSwitch: Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, Heng Ji

📃Paper🤗Models

About SafeSwitch

SafeSwitch proposes a novel solution to balance safety and utility. Unlike traditional methods that bias LLMs uniformly toward conservative responses, SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.

We train a safety prober to extract information from internal states and predict unsafe behaviors pre-generation. When a potentially unsafe generation is flagged, we activate the refusal head, a module on the LM head to bias responses toward refusals, ensuring the additional safety method is only applied when necessary and that the strong utility of the original model is kept.

Repo structure

  • dataset/: contains the data used to train and evaluate SafeSwitch.

    • sorry-bench-plus.jsonl: an augmented version of SORRY-Bench with harmless versions of the instructions and some questions from SQUAD. Used to train and evaluate safety probers.
    • sorry-bench-train.jsonl, trustllm-misuse_train.jsonl, trustllm-jailbreak_train.jsonl: unsafe instructions used to train the refusal head.
    • judge_prompts.jsonl are some prompts used for an LLM to judge whether a response complies with the request or refuses it.
    • The rest are evaluation benchmarks. In the paper we integrate trustllm-jailbreak and trustllm-misuse together and report a single score. trustllm-exaggerated_safety corresponds to Over Refusal in the paper.
  • src/data/: contains code to obtain the datasets.

  • src/prober/: contains code to train and evaluate safety probers.

  • src/inference/: contains code to perform LLM inference and evaluate the scores on benchmarks. In particular, head_train.py is used to train the refusal head.

  • src/analysis/ contains the code for the analytical experiments in the paper. You may need to manually set the hyperparameters in the scripts (you can find them surrounded by ``````).

Train your SafeSwitch

Step 0: Prepare the environment

  • python>=3.10 is required for this repo.
  • It's recommended to use pip package manager. Run pip install -r requirements.txt to install all requirements.
  • Run cd alpace_eval and pip install -e . to install the alpaca-eval package.
  • Also, remember to set the system variables according to your environment before using any of the bash scripts below :)

Step 1: Train the Prober

Set the model_list parameter in bash_scripts/train_prober_pipeline.sh and run the script to train and evaluate safety probers.

You can directly use our trained probers from the model repo.

Step 2: Train the Refusal Head

Set the model_list parameter in bash_scripts/train_refusal_head.sh and run the script to train the refusal head. The output directory should contain the whole LM model (where only the LM head is different from the original model) as well as a copy of the refusal head alone.

You can directly use our trained refusal heads from the model repo. You need to run src/convert_head.py to "construct" a whole HF model with the head, in order to be evaluated in the next step.

Step 3: Evaluating SafeSwitch

After training the prober and the refusal head, our code automatically performs SafeSwitch-regulated generation. You can run the evaluation with: bash_scripts/eval_pipeline.sh.

You can also run the following script to interact with Safeswitch:

python src/safeswitch_pipeline.py --model [Model] \
    --llm_dir [dir] \
    --classifier_dir [dir] \
    --refusal_head_dir [dir]

Cite this paper

If you find this repo or the paper useful, please cite:

@article{han2025internal,
      title={Internal Activation as the Polar Star for Steering Unsafe LLM Behavior}, 
      author={Peixuan Han and Cheng Qian and Xiusi Chen and Yuji Zhang and Denghui Zhang and Heng Ji},
      year={2025},
      journal={arXiv preprint arXiv:2502.01042},
      url={https://arxiv.org/abs/2502.01042}
}