Skip to content

Add EvilTwin optimizer for evil twin prompt optimization #7893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

ramisbahi
Copy link

This PR introduces the EvilTwin optimizer to DSPy, implementing the Greedy Coordinate Gradient (GCG) algorithm for evil twin prompt optimization. EvilTwin generates evil twin prompts that induce similar model outputs while appearing garbled or obfuscated, based on the "Prompts have evil twins" paper.

Key Features:

  • Uses KL divergence minimization to iteratively modify prompts to achieve a similar output distribution.
  • Runs on local models (default: "EleutherAI/gpt-neo-125M") since it requires gradients, logits, and token-level likelihoods, which API-based LLMs don’t expose.
  • Supports customizable optimization settings (e.g., n_epochs, batch_size, top_k, gamma for fluency penalty).
  • Provides an easy way to retrieve the final optimized prompt via optimizer.optimized_prompt.

Example Usage:

from dspy.teleprompt.evil_twin import EvilTwin

predictor = dspy.Predict('question -> answer')
q = "Describe the definition of artificial intelligence in one sentence."

optimizer = EvilTwin(question=q)
optimized_predictor = optimizer.compile(program=predictor)

print("Optimized Evil Twin Prompt:", optimizer.optimized_prompt)
original_response = predictor(question=q)
evil_twin_response = optimized_predictor(question=q)

print("Original Output:", original_response.answer)
print("Evil Twin Output:", evil_twin_response.answer)

Notes:

  • EvilTwin is best run on a GPU due to the computational cost of token gradient updates.
  • Future work may include warm start initialization, as proposed in the Evil Twins paper.

This PR enhances DSPy’s optimizer suite by enabling adversarial prompt exploration, making it a powerful tool for LLM evaluation and security research. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant