The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes. By moving beyond traditional keyword-based searches, it provides users with more accurate and relevant results through semantic understanding.
- Semantic Search: Leverages OpenAI embeddings in Pinecone for context-aware dataset discovery
- Retrieval Augmented Generation (RAG): Generates explanations with inline citations using GPT 4o-mini
- Content Quality Assurance: Built-in moderation and hallucination detection
- Automated Data Pipeline: Dagster-managed continuous updates to the vector database
The system uses Retrieval Augmented Generation (RAG) with two core components:
- Retrieval: Finds relevant datasets using semantic similarity
- Generation: Explains dataset relevance using retrieved information
Key components:
- Backend: FastAPI & LangGraph for structured processing and API integration
- Data Pipeline: Dagster for automated data lifecycle management
- Vector Database: Pinecone for efficient semantic search operations
- AI Models: OpenAI for semantic understanding and generation
Dagster automates the entire data pipeline - from ingestion to indexing - ensuring the Pinecone database stays current with new datasets.
- Semantic Search: Queries are converted to embeddings and matched against Pinecone's vector database
- RAG Pipeline: Retrieved datasets are used to generate context-aware explanations
- Quality Control: Moderation and hallucination detection ensure reliable outputs
FastAPI is used to create a RESTful API allowing for the RAG graphs to interact with external services. The following endpoints are provided:
- POST Query (
/query
): Accepts a search query and returns relevant documents alongside a unique thread ID associated with the query using the LangGraphsearch_graph
. - GET Explain (
/explain/{thread_id}
): Accepts athread_id
anddocid
to provide an explanation using the LangGraphgeneration_graph
.
The entire project is containerised using Docker or Podman compose (tested with Podman). The compose.yml
file gives more detail.
Ensure you have the following installed:
- Python >=3.12,<3.13
- Docker or Podman
- Git
To contribute, please follow these steps:
-
Clone the repository:
git clone https://github.com/cjber/semantic-catalogue.git
-
Navigate to the project directory:
cd semantic-catalogue
-
Install dependencies:
This project uses
uv
:uv sync
Alternatively,
pip
can also be used to install from apyproject.yaml
:pip install . # ensure you are using a venv
-
Configure the system:
Edit the
config/config.toml
file to customise model settings.
Warning
This project is not intended for public use and requires access to a private database.
To run the project successfully, it expects a .env
file with the following environment variables defined:
#!/bin/bash
export CDRC_USERNAME=""
export CDRC_PASSWORD=""
export CDRC_FORM_BUILD_ID=""
export PINECONE_API_KEY=""
export OPENAI_API_KEY=""
export LLAMA_CLOUD_API_KEY=""
# optional
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=""
export LANGCHAIN_PROJECT=""
The project is fully containerised using podman
/docker
compose
. To run the full system, execute:
podman compose up -d
Once up and running, access the semantic catalogue search at http://localhost:8001
.
The Dagster UI is available at http://localhost:3000
. Adjust the Auto Materialise and Sensor settings to start the automation.
All scripts run independently, for debugging purposes. For example, to run the full LangGraph pipeline.
python -m semantic_catalogue.model.main
This submits a test query and print the outputs. Chains also provide outputs for testing:
python -m semantic_catalogue.model.chains.hallucination
This prints the output of a test generation that has hallucinations.
You can also start the API server independently:
fastapi dev semantic_catalogue/search_api/api.py