This project provides a comprehensive toolkit for evaluating the performance of language models, particularly in the context of question-answering tasks using Amazon Bedrock and other AI services.
The toolkit offers a set of utility functions that enable developers to fetch and process text data, create datasets, interact with Amazon Bedrock Knowledge Bases, and perform various evaluations on language model outputs. It leverages popular libraries such as LangChain, LlamaIndex, and Hugging Face Datasets to provide a robust evaluation framework.
utils.py: Core utility functions for text processing, dataset creation, and model evaluation.requirements.txt: List of Python package dependencies required for the project.CODE_OF_CONDUCT.md: Guidelines for contributor behavior and community standards.CONTRIBUTING.md: Instructions for contributing to the project.README.md: This file, providing an overview and usage instructions for the project.
- Ensure you have Python 3.7 or later installed.
- Clone the repository to your local machine.
- Install the required dependencies:
pip install -r requirements.txt- Set up an Amazon Bedrock Knowledge Base and note its ID.
- Update the
knowledge_base_idvariable inutils.pywith your Knowledge Base ID. - Configure your AWS credentials to allow access to Bedrock and other AWS services.
To split a document from a URL into chunks:
from utils import split_document_from_url
chunks = split_document_from_url("https://example.com", chunk_size=1000, chunk_overlap=100)To create an Amazon Bedrock Knowledge Base retriever:
from utils import get_bedrock_retriever
retriever = get_bedrock_retriever(text_chunks, region_name="us-west-2")To create a dataset for evaluation:
from utils import build_dataset
dataset = build_dataset(eval_questions, ground_truth, predictions, text_content)To evaluate the model using various metrics:
from utils import evaluate_llama_index_metric
from llama_index.core.evaluation import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator()
results = evaluate_llama_index_metric(evaluator, dataset)-
AWS Credentials Not Found
- Problem:
botocore.exceptions.NoCredentialsError - Solution: Ensure AWS credentials are properly configured in
~/.aws/credentialsor as environment variables.
- Problem:
-
Knowledge Base Not Found
- Problem:
botocore.exceptions.ClientError: An error occurred (ResourceNotFoundException) when calling the RetrieveOperation - Solution: Verify the
knowledge_base_idinutils.pyis correct and the Knowledge Base exists in your AWS account.
- Problem:
To enable verbose logging:
- Add the following at the beginning of your script:
import logging
logging.basicConfig(level=logging.DEBUG)- Look for log files in your current working directory or check the console output for detailed information.
- Monitor the time taken for document splitting and retrieval operations.
- For large documents, consider increasing the chunk size to reduce the number of API calls.
- Use batch processing when evaluating multiple queries to improve throughput.
The toolkit processes data through the following steps:
- Document Retrieval: Fetch documents from web URLs.
- Text Chunking: Split documents into manageable chunks.
- Knowledge Base Integration: Store chunks in Amazon Bedrock Knowledge Base.
- Query Processing: Use the Knowledge Base to retrieve relevant information for queries.
- Answer Generation: Generate answers using the retrieved information.
- Evaluation: Assess the quality of generated answers using various metrics.
[Web Document] -> [Text Chunker] -> [Bedrock Knowledge Base]
|
v
[User Query] -> [Retriever] -> [Answer Generator] -> [Evaluator]
|
v
[Evaluation Results]
Note: The actual answer generation step is not explicitly included in the provided code but is assumed to be part of the workflow when using this evaluation toolkit.