SafeSwitch proposes a novel solution to balance safety and utility. Unlike traditional methods that bias LLMs uniformly toward conservative responses, SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.
We train a safety prober to extract information from internal states and predict unsafe behaviors pre-generation. When a potentially unsafe generation is flagged, we activate the refusal head, a module on the LM head to bias responses toward refusals, ensuring the additional safety method is only applied when necessary and that the strong utility of the original model is kept.
-
dataset/
: contains the data used to train and evaluate SafeSwitch.sorry-bench-plus.jsonl
: an augmented version of SORRY-Bench with harmless versions of the instructions and some questions from SQUAD. Used to train and evaluate safety probers.sorry-bench-train.jsonl
,trustllm-misuse_train.jsonl
,trustllm-jailbreak_train.jsonl
: unsafe instructions used to train the refusal head.judge_prompts.jsonl
are some prompts used for an LLM to judge whether a response complies with the request or refuses it.- The rest are evaluation benchmarks. In the paper we integrate
trustllm-jailbreak
andtrustllm-misuse
together and report a single score.trustllm-exaggerated_safety
corresponds toOver Refusal
in the paper.
-
src/data/
: contains code to obtain the datasets. -
src/prober/
: contains code to train and evaluate safety probers. -
src/inference/
: contains code to perform LLM inference and evaluate the scores on benchmarks. In particular,head_train.py
is used to train the refusal head. -
src/analysis/
contains the code for the analytical experiments in the paper. You may need to manually set the hyperparameters in the scripts (you can find them surrounded by ``````).
python>=3.10
is required for this repo.- It's recommended to use pip package manager. Run
pip install -r requirements.txt
to install all requirements. - Run
cd alpace_eval
andpip install -e .
to install the alpaca-eval package. - Also, remember to set the system variables according to your environment before using any of the bash scripts below :)
Set the model_list
parameter in bash_scripts/train_prober_pipeline.sh
and run the script to train and evaluate safety probers.
You can directly use our trained probers from the model repo.
Set the model_list
parameter in bash_scripts/train_refusal_head.sh
and run the script to train the refusal head. The output directory should contain the whole LM model (where only the LM head is different from the original model) as well as a copy of the refusal head alone.
You can directly use our trained refusal heads from the model repo. You need to run src/convert_head.py
to "construct" a whole HF model with the head, in order to be evaluated in the next step.
After training the prober and the refusal head, our code automatically performs SafeSwitch-regulated generation. You can run the evaluation with: bash_scripts/eval_pipeline.sh
.
You can also run the following script to interact with Safeswitch:
python src/safeswitch_pipeline.py --model [Model] \
--llm_dir [dir] \
--classifier_dir [dir] \
--refusal_head_dir [dir]
If you find this repo or the paper useful, please cite:
@article{han2025internal,
title={Internal Activation as the Polar Star for Steering Unsafe LLM Behavior},
author={Peixuan Han and Cheng Qian and Xiusi Chen and Yuji Zhang and Denghui Zhang and Heng Ji},
year={2025},
journal={arXiv preprint arXiv:2502.01042},
url={https://arxiv.org/abs/2502.01042}
}