OdyTest - Model Evaluation Suite

A comprehensive testing framework for evaluating LLM model performance in semantic parsing tasks for the Roster Assistant project.

Overview

OdyTest provides a modular, sequential testing approach to fairly compare different LLM models across multiple languages and prompt variants. The name "OdyTest" reflects the journey-like nature of testing multiple models sequentially, like an odyssey through different AI capabilities.

Features

Sequential Model Testing: Test one model at a time for fair resource allocation
Multilingual Support: Test cases in English, German, French, and Italian
Multiple Prompt Variants: Compare different prompt engineering approaches
Comprehensive Metrics: JSON validity, intent accuracy, entity extraction, timing
Automated Reporting: Generate detailed comparative analysis reports
Resource Monitoring: Track CPU and memory usage during testing
Modular Architecture: Clean separation of concerns for easy maintenance

Architecture

odytest/
├── config.py              # Model configurations and settings
├── test_cases.py           # Comprehensive multilingual test scenarios
├── prompt_manager.py       # Prompt variants and management
├── model_evaluator.py      # Core testing and evaluation logic
├── results_manager.py      # Result storage and analysis
├── test_single_model.py    # Single model test runner
├── run_sequential_tests.py # Sequential test orchestrator
├── demo.py                 # Demonstration script
├── run.bat                 # Windows batch runner with interactive menu
├── setup.bat               # Environment setup script
└── README.md              # This documentation

Quick Start

1. Setup (Windows)

# Navigate to the odytest directory
cd tests/odytest

# Run setup script to create virtual environment
setup.bat

# Start testing with the runner
run.bat

2. Easy Usage with Batch Scripts

# Interactive menu
run.bat

# Quick commands
run.bat demo                    # Show demonstration
run.bat list-models             # List available models
run.bat test gemma3_1b         # Test single model
run.bat sequential             # Test all models
run.bat report                 # Generate comparative report

3. Direct Python Usage

# Test with production prompt
python test_single_model.py gemma3_1b

# Test with different prompt variant
python test_single_model.py qwen3_1_7b --prompt multilingual

# List available models and prompts
python test_single_model.py --list-models
python test_single_model.py --list-prompts

# Run sequential tests for all models
python run_sequential_tests.py

# Test specific models with custom prompt
python run_sequential_tests.py --models gemma3_1b qwen3_1_7b --prompt multilingual

# Generate report from existing results
python run_sequential_tests.py --generate-report-only

Supported Models

Gemma3-1B: Simple, lightweight and fast - ideal for quick testing
Qwen3-1.7B: Superior human preference alignment with balanced performance
Qwen3-0.6B: Compact model with superior human preference alignment
JOSIEFIED-Qwen3-0.6B: Modified version of Qwen3 0.6B with enhanced capabilities

Test Categories

Emergency Replacement

Employee unavailability scenarios
Urgent shift coverage requests
Multilingual emergency contexts

Schedule Creation

Full schedule generation requests
Optimization preference handling
Time period specifications

Information Requests

Employee availability queries
Staff capability inquiries
General information requests

View Schedule

Current schedule display requests
Specific time period views
Weekend/day-specific queries

Modifications

Shift swapping requests
Employee reassignments
Schedule adjustments

Analysis

Hypothetical scenario evaluation
Impact analysis requests
What-if questions

Edge Cases

Empty inputs
Ambiguous requests
Complex multilingual scenarios

Prompt Variants

Production

Current production prompt from semantic parser configuration

Multilingual

Enhanced with explicit multilingual support and examples

Concise

Streamlined for faster inference with essential features only

Structured

Highly organized with clear sections and comprehensive rules

Chain of Thought

Encourages step-by-step reasoning for complex scenarios

Evaluation Metrics

Accuracy Metrics

Intent Accuracy: Correct classification of user intent
Entity Extraction: Accuracy of extracted entities
JSON Validity: Structural correctness of output

Performance Metrics

Inference Time: Average, median, min/max response times
Success Rate: Percentage of successful API calls
Confidence Scores: Model confidence in predictions

Language-Specific Analysis

Per-language accuracy breakdown
Multilingual capability assessment
Language-specific recommendations

Difficulty Analysis

Performance across different complexity levels
Edge case handling capabilities
Robustness evaluation

Usage Examples

Testing Workflow

Load Model in Ollama
```
ollama pull gemma3:1b
```
Run Single Model Test
```
python test_single_model.py gemma3_1b
```

Unload Model and Load Next

ollama rm gemma3:1b
ollama pull qwen3:1.7b

Continue Testing
```
python test_single_model.py qwen3_1_7b
```

Generate Comparative Report

python run_sequential_tests.py --generate-report-only

Programmatic Usage

from test_single_model import test_single_model
from run_sequential_tests import SequentialTestRunner
from test_cases import get_test_summary
from demo import demo_test_suite_overview

# Print package information
demo_test_suite_overview()

# Get test suite summary
summary = get_test_summary()
print(f"Total test cases: {summary['total_cases']}")

# Test single model
result_file = test_single_model("gemma3_1b", "multilingual")

# Run sequential tests
runner = SequentialTestRunner("production")
success = runner.run_complete_evaluation(["gemma3_1b", "qwen3_1_7b"])

Output Files

Individual Model Results

results/gemma3_1b_production_20250603_143000.json

Contains:

Test metadata (model, prompt, timestamp)
Summary statistics
Detailed results for each test case
Performance metrics

Comparative Analysis

results/comparative_analysis_20250603_143000.json

Contains:

Model rankings by different criteria
Performance comparison matrix
Language-specific analysis
Deployment recommendations

Configuration

Model Configuration

MODEL_CONFIGS = {
    "gemma3_1b": ModelConfig(
        name="gemma3:1b",
        temperature=0.1,
        top_p=0.95,
        timeout=10,
        max_retries=3,
        description="gemma3 1b - Simple, lightweight and fast"
    )
}

Test Configuration

TEST_CONFIG = {
    "max_retries": 2,
    "retry_delay": 1.0,
    "save_raw_outputs": True,
    "validate_checksums": True,
    "monitor_resources": True
}

Best Practices

Fair Testing

Test one model at a time to ensure fair resource allocation
Use identical test cases across all models
Monitor system resources during testing
Validate model availability before testing

Result Analysis

Compare models using multiple metrics, not just accuracy
Consider language-specific performance differences
Evaluate speed vs. accuracy trade-offs
Review edge case handling capabilities

Production Deployment

Use comparative analysis for model selection
Consider specific language requirements
Balance accuracy and inference speed
Test with production-like workloads

Troubleshooting

Common Issues

Model Not Available

Ensure Ollama is running
Verify model is loaded: ollama list
Check model name spelling

Import Errors

Ensure you're in the project root directory
Check Python path configuration
Verify all dependencies are installed

Memory Issues

Test one model at a time
Unload models between tests
Monitor system resources

JSON Validation Failures

Check prompt formatting
Verify model output structure
Review validation criteria

Dependencies

pip install requests psutil ollama httpx

Contributing

When adding new test cases:

Follow the existing TestCase structure
Include expected entities and intents
Add appropriate difficulty and category labels
Test across multiple languages when applicable

When adding new models:

Update MODEL_CONFIGS in config.py
Test with existing test suite
Verify timeout and parameter settings
Document model-specific considerations

License

This testing framework is licensed under the MIT license.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
results		results
QUICKSTART.md		QUICKSTART.md
README.md		README.md
__init__.py		__init__.py
config.py		config.py
demo.py		demo.py
model_evaluator.py		model_evaluator.py
prompt_manager.py		prompt_manager.py
requirements.txt		requirements.txt
results_manager.py		results_manager.py
run.bat		run.bat
run_sequential_tests.py		run_sequential_tests.py
setup.bat		setup.bat
test_cases.py		test_cases.py
test_ollama_library.py		test_ollama_library.py
test_single_model.py		test_single_model.py

Folders and files

Latest commit

History

Repository files navigation

OdyTest - Model Evaluation Suite

Overview

Features

Architecture

Quick Start

1. Setup (Windows)

2. Easy Usage with Batch Scripts

3. Direct Python Usage

Supported Models

Test Categories

Emergency Replacement

Schedule Creation

Information Requests

View Schedule

Modifications

Analysis

Edge Cases

Prompt Variants

Production

Multilingual

Concise

Structured

Chain of Thought

Evaluation Metrics

Accuracy Metrics

Performance Metrics

Language-Specific Analysis

Difficulty Analysis

Usage Examples

Testing Workflow

Programmatic Usage

Output Files

Individual Model Results

Comparative Analysis

Configuration

Model Configuration

Test Configuration

Best Practices

Fair Testing

Result Analysis

Production Deployment

Troubleshooting

Common Issues

Dependencies

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages