Golden Dataset generation process

Step 0: install the package

poetry install --no-root
poetry shell

Step 1: download the Vespa database

This works by connecting to the Postgresql database, getting a list of all document ids, then downloading all chunks for that document id.

The database connection information is hardcoded in the script, it may need adjusting.

python step1-download-vespa-database.py

Step 2: generate topic-based questions

python step2.0-topic-generation.py data-download/GS_CEMS/ datasets/GS_CEMS-topics.json 200

Step 2.1: filter only English questions (optional)

python step2.1-filter-for-english.py datasets/GS_CEMS-topics.json datasets/GS_CEMS-topics-en.json

Step 2.2: extract the questions to a new text file

python step2.1-extract-primary-questions.py datasets/GS_CEMS-topics-en.json datasets/GS_CEMS-questions.txt

Step 3: generate GoldenSet dataset

python step3.0-generate-danswer-dataset.py datasets/GS_CEMS-questions.txt datasets/GS_CEMS-goldenset.json

Step 3.5: load multiple goldenset-style datasets in local Trulens

python step3.5-dataset-to-virtual-trulens.py dataset.json datasets/GS_CEMS-goldenset.json

Step 4: convert GoldenSet to Excel

python step4.0-dataset2xls.py datasets/GS_CEMS-goldenset.json datasets/GS_CEMS-goldenset.xls

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
notebooks		notebooks
ragas-test		ragas-test
trulens-test		trulens-test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
gen.sh		gen.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
step1.0-download-vespa-database.py		step1.0-download-vespa-database.py
step2.0-topic-generation.py		step2.0-topic-generation.py
step2.1-filter-for-english.py		step2.1-filter-for-english.py
step2.2-extract-primary-questions.py		step2.2-extract-primary-questions.py
step2.9-optimize-questions_only.py		step2.9-optimize-questions_only.py
step3.0-generate-danswer-dataset.py		step3.0-generate-danswer-dataset.py
step3.1-optimize-danswer-dataset.py		step3.1-optimize-danswer-dataset.py
step3.5-datasets-to-virtual-trulens.py		step3.5-datasets-to-virtual-trulens.py
step4.0-dataset2xls.py		step4.0-dataset2xls.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Golden Dataset generation process

Step 0: install the package

Step 1: download the Vespa database

Step 2: generate topic-based questions

Step 2.1: filter only English questions (optional)

Step 2.2: extract the questions to a new text file

Step 3: generate GoldenSet dataset

Step 3.5: load multiple goldenset-style datasets in local Trulens

Step 4: convert GoldenSet to Excel

About

Releases

Packages

Contributors 2

Languages

eea/observia-chatbot-dataset

Folders and files

Latest commit

History

Repository files navigation

Golden Dataset generation process

Step 0: install the package

Step 1: download the Vespa database

Step 2: generate topic-based questions

Step 2.1: filter only English questions (optional)

Step 2.2: extract the questions to a new text file

Step 3: generate GoldenSet dataset

Step 3.5: load multiple goldenset-style datasets in local Trulens

Step 4: convert GoldenSet to Excel

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages