A/B testing #793

boxabirds · 2025-02-17T14:20:29Z

boxabirds
Feb 17, 2025

Problem: with such wild variability in output based on not only the LLMs but the prompts, small changes can result in quite significant differences.

Solution: ability to specify a list of prompt variations and a list of different LLMs to try.

You could use Optuna for efficient evaluation (cf DSPy), along with argilla the human evaluation.

sysradium · 2025-02-18T12:30:00Z

@boxabirds have you got an API in mind?

0 replies