A comprehensive testing framework for evaluating LLM model performance in semantic parsing tasks for the Roster Assistant project.
OdyTest provides a modular, sequential testing approach to fairly compare different LLM models across multiple languages and prompt variants. The name "OdyTest" reflects the journey-like nature of testing multiple models sequentially, like an odyssey through different AI capabilities.
- Sequential Model Testing: Test one model at a time for fair resource allocation
- Multilingual Support: Test cases in English, German, French, and Italian
- Multiple Prompt Variants: Compare different prompt engineering approaches
- Comprehensive Metrics: JSON validity, intent accuracy, entity extraction, timing
- Automated Reporting: Generate detailed comparative analysis reports
- Resource Monitoring: Track CPU and memory usage during testing
- Modular Architecture: Clean separation of concerns for easy maintenance
odytest/
├── config.py # Model configurations and settings
├── test_cases.py # Comprehensive multilingual test scenarios
├── prompt_manager.py # Prompt variants and management
├── model_evaluator.py # Core testing and evaluation logic
├── results_manager.py # Result storage and analysis
├── test_single_model.py # Single model test runner
├── run_sequential_tests.py # Sequential test orchestrator
├── demo.py # Demonstration script
├── run.bat # Windows batch runner with interactive menu
├── setup.bat # Environment setup script
└── README.md # This documentation
# Navigate to the odytest directory
cd tests/odytest
# Run setup script to create virtual environment
setup.bat
# Start testing with the runner
run.bat# Interactive menu
run.bat
# Quick commands
run.bat demo # Show demonstration
run.bat list-models # List available models
run.bat test gemma3_1b # Test single model
run.bat sequential # Test all models
run.bat report # Generate comparative report# Test with production prompt
python test_single_model.py gemma3_1b
# Test with different prompt variant
python test_single_model.py qwen3_1_7b --prompt multilingual
# List available models and prompts
python test_single_model.py --list-models
python test_single_model.py --list-prompts
# Run sequential tests for all models
python run_sequential_tests.py
# Test specific models with custom prompt
python run_sequential_tests.py --models gemma3_1b qwen3_1_7b --prompt multilingual
# Generate report from existing results
python run_sequential_tests.py --generate-report-only- Gemma3-1B: Simple, lightweight and fast - ideal for quick testing
- Qwen3-1.7B: Superior human preference alignment with balanced performance
- Qwen3-0.6B: Compact model with superior human preference alignment
- JOSIEFIED-Qwen3-0.6B: Modified version of Qwen3 0.6B with enhanced capabilities
- Employee unavailability scenarios
- Urgent shift coverage requests
- Multilingual emergency contexts
- Full schedule generation requests
- Optimization preference handling
- Time period specifications
- Employee availability queries
- Staff capability inquiries
- General information requests
- Current schedule display requests
- Specific time period views
- Weekend/day-specific queries
- Shift swapping requests
- Employee reassignments
- Schedule adjustments
- Hypothetical scenario evaluation
- Impact analysis requests
- What-if questions
- Empty inputs
- Ambiguous requests
- Complex multilingual scenarios
Current production prompt from semantic parser configuration
Enhanced with explicit multilingual support and examples
Streamlined for faster inference with essential features only
Highly organized with clear sections and comprehensive rules
Encourages step-by-step reasoning for complex scenarios
- Intent Accuracy: Correct classification of user intent
- Entity Extraction: Accuracy of extracted entities
- JSON Validity: Structural correctness of output
- Inference Time: Average, median, min/max response times
- Success Rate: Percentage of successful API calls
- Confidence Scores: Model confidence in predictions
- Per-language accuracy breakdown
- Multilingual capability assessment
- Language-specific recommendations
- Performance across different complexity levels
- Edge case handling capabilities
- Robustness evaluation
-
Load Model in Ollama
ollama pull gemma3:1b
-
Run Single Model Test
python test_single_model.py gemma3_1b
-
Unload Model and Load Next
ollama rm gemma3:1b ollama pull qwen3:1.7b
-
Continue Testing
python test_single_model.py qwen3_1_7b
-
Generate Comparative Report
python run_sequential_tests.py --generate-report-only
from test_single_model import test_single_model
from run_sequential_tests import SequentialTestRunner
from test_cases import get_test_summary
from demo import demo_test_suite_overview
# Print package information
demo_test_suite_overview()
# Get test suite summary
summary = get_test_summary()
print(f"Total test cases: {summary['total_cases']}")
# Test single model
result_file = test_single_model("gemma3_1b", "multilingual")
# Run sequential tests
runner = SequentialTestRunner("production")
success = runner.run_complete_evaluation(["gemma3_1b", "qwen3_1_7b"])results/gemma3_1b_production_20250603_143000.json
Contains:
- Test metadata (model, prompt, timestamp)
- Summary statistics
- Detailed results for each test case
- Performance metrics
results/comparative_analysis_20250603_143000.json
Contains:
- Model rankings by different criteria
- Performance comparison matrix
- Language-specific analysis
- Deployment recommendations
MODEL_CONFIGS = {
"gemma3_1b": ModelConfig(
name="gemma3:1b",
temperature=0.1,
top_p=0.95,
timeout=10,
max_retries=3,
description="gemma3 1b - Simple, lightweight and fast"
)
}TEST_CONFIG = {
"max_retries": 2,
"retry_delay": 1.0,
"save_raw_outputs": True,
"validate_checksums": True,
"monitor_resources": True
}- Test one model at a time to ensure fair resource allocation
- Use identical test cases across all models
- Monitor system resources during testing
- Validate model availability before testing
- Compare models using multiple metrics, not just accuracy
- Consider language-specific performance differences
- Evaluate speed vs. accuracy trade-offs
- Review edge case handling capabilities
- Use comparative analysis for model selection
- Consider specific language requirements
- Balance accuracy and inference speed
- Test with production-like workloads
Model Not Available
- Ensure Ollama is running
- Verify model is loaded:
ollama list - Check model name spelling
Import Errors
- Ensure you're in the project root directory
- Check Python path configuration
- Verify all dependencies are installed
Memory Issues
- Test one model at a time
- Unload models between tests
- Monitor system resources
JSON Validation Failures
- Check prompt formatting
- Verify model output structure
- Review validation criteria
pip install requests psutil ollama httpxWhen adding new test cases:
- Follow the existing TestCase structure
- Include expected entities and intents
- Add appropriate difficulty and category labels
- Test across multiple languages when applicable
When adding new models:
- Update MODEL_CONFIGS in config.py
- Test with existing test suite
- Verify timeout and parameter settings
- Document model-specific considerations
This testing framework is licensed under the MIT license.
Copyright 2025 Jean-Marie Heck
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.