Skip to content

intuit/fasteval

Repository files navigation

fasteval

PyPI version Python versions CI License

A decorator-first LLM evaluation library for testing AI agents and LLMs. Stack decorators to define evaluation criteria, run with pytest.

Features

  • Decorator-based metrics -- stack @fe.correctness, @fe.relevance, @fe.hallucination, and 30+ more
  • pytest native -- run evaluations with pytest, get familiar pass/fail output
  • LLM-as-judge + deterministic -- semantic LLM metrics alongside ROUGE, exact match, JSON schema, regex
  • Multi-modal -- evaluate vision, audio, and image generation models
  • Conversation metrics -- context retention, topic drift, consistency for multi-turn agents
  • RAG metrics -- faithfulness, contextual precision, contextual recall, answer correctness
  • Tool trajectory -- verify agent tool calls, argument matching, call sequences
  • Pluggable providers -- OpenAI (default), Anthropic, Azure OpenAI, Ollama

Quick Start

pip install fasteval-core

Set your LLM provider key:

export OPENAI_API_KEY=sk-your-key-here

Write your first evaluation test:

import fasteval as fe

@fe.correctness(threshold=0.8)
@fe.relevance(threshold=0.7)
def test_qa_agent():
    response = my_agent("What is the capital of France?")
    fe.score(response, expected_output="Paris", input="What is the capital of France?")

Run it:

pytest test_qa_agent.py -v

Installation

# pip
pip install fasteval-core

# uv
uv add fasteval-core

Optional Extras

# Anthropic provider
pip install fasteval-core[anthropic]

# Vision-language evaluation (GPT-4V, Claude Vision)
pip install fasteval-core[vision]

# Audio/speech evaluation (Whisper, ASR)
pip install fasteval-core[audio]

# Image generation evaluation (DALL-E, Stable Diffusion)
pip install fasteval-core[image-gen]

# All multi-modal features
pip install fasteval-core[multimodal]

Usage Examples

Deterministic Metrics

import fasteval as fe

@fe.contains()
def test_keyword_present():
    fe.score("The answer is 42", expected_output="42")

@fe.rouge(threshold=0.6, rouge_type="rougeL")
def test_summary_quality():
    fe.score(actual_output=summary, expected_output=reference)

RAG Evaluation

@fe.faithfulness(threshold=0.8)
@fe.contextual_precision(threshold=0.7)
def test_rag_pipeline():
    result = rag_pipeline("How does photosynthesis work?")
    fe.score(
        actual_output=result.answer,
        context=result.retrieved_docs,
        input="How does photosynthesis work?",
    )

Tool Trajectory

@fe.tool_call_accuracy(threshold=0.9)
def test_agent_tools():
    result = agent.run("Book a flight to Paris")
    fe.score(
        actual_tools=result.tool_calls,
        expected_tools=[
            {"name": "search_flights", "args": {"destination": "Paris"}},
            {"name": "book_flight"},
        ],
    )

Metric Stacks

@fe.correctness(threshold=0.8, weight=2.0)
@fe.relevance(threshold=0.7, weight=1.0)
@fe.coherence(threshold=0.6, weight=1.0)
def test_comprehensive():
    response = agent("Explain quantum computing")
    fe.score(response, expected_output=reference_answer, input="Explain quantum computing")

Plugins

Plugin Description Install
fasteval-langfuse Evaluate Langfuse production traces with fasteval metrics pip install fasteval-langfuse
fasteval-langgraph Test harness for LangGraph agents pip install fasteval-langgraph
fasteval-observe Runtime monitoring with async sampling pip install fasteval-observe

Local Development

# Install uv
brew install uv

# Create virtual environment and install dependencies
uv sync --all-extras

# Run the test suite
uv run tox

# Format code
uv run black .
uv run isort .

# Type checking
uv run mypy .

Documentation

Full documentation is available in the docs/ directory, covering:

Contributing

See CONTRIBUTING.md for development setup, coding standards, and how to submit pull requests.

License

Apache License 2.0 -- see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages