Skip to content

eea/observia-chatbot-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Golden Dataset generation process

Step 0: install the package

poetry install --no-root
poetry shell

Step 1: download the Vespa database

This works by connecting to the Postgresql database, getting a list of all document ids, then downloading all chunks for that document id.

The database connection information is hardcoded in the script, it may need adjusting.

python step1-download-vespa-database.py

Step 2: generate topic-based questions

python step2.0-topic-generation.py data-download/GS_CEMS/ datasets/GS_CEMS-topics.json 200

Step 2.1: filter only English questions (optional)

python step2.1-filter-for-english.py datasets/GS_CEMS-topics.json datasets/GS_CEMS-topics-en.json

Step 2.2: extract the questions to a new text file

python step2.1-extract-primary-questions.py datasets/GS_CEMS-topics-en.json datasets/GS_CEMS-questions.txt

Step 3: generate GoldenSet dataset

python step3.0-generate-danswer-dataset.py datasets/GS_CEMS-questions.txt datasets/GS_CEMS-goldenset.json

Step 3.5: load multiple goldenset-style datasets in local Trulens

python step3.5-dataset-to-virtual-trulens.py dataset.json datasets/GS_CEMS-goldenset.json

Step 4: convert GoldenSet to Excel

python step4.0-dataset2xls.py datasets/GS_CEMS-goldenset.json datasets/GS_CEMS-goldenset.xls

About

An evaluation dataset for RAG projects for Observia

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published