The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes. By moving beyond traditional keyword-based searches, it provides users with more accurate and relevant results through semantic understanding.
- Semantic Search: Leverages OpenAI embeddings in Pinecone for context-aware dataset discovery
- Retrieval Augmented Generation (RAG): Generates explanations with inline citations using GPT 4o-mini
- Content Quality Assurance: Built-in moderation and hallucination detection
- Automated Data Pipeline: Dagster-managed continuous updates to the vector database
The system uses Retrieval Augmented Generation (RAG) with two core components:
- Retrieval: Finds relevant datasets using semantic similarity
- Generation: Explains dataset relevance using retrieved information
Key components:
- Backend: FastAPI & LangGraph for structured processing and API integration
- Data Pipeline: Dagster for automated data lifecycle management
- Vector Database: Pinecone for efficient semantic search operations
- AI Models: OpenAI for semantic understanding and generation
Dagster automates the entire data pipeline - from ingestion to indexing - ensuring the Pinecone database stays current with new datasets.
- Semantic Search: Queries are converted to embeddings and matched against Pinecone's vector database
- RAG Pipeline: Retrieved datasets are used to generate context-aware explanations
- Quality Control: Moderation and hallucination detection ensure reliable outputs
FastAPI is used to create a RESTful API allowing for the RAG graphs to interact with external services. The following endpoints are provided:
- POST Query (
/query
): Accepts a search query and returns relevant documents alongside a unique thread ID associated with the query using the LangGraphsearch_graph
. - GET Explain (
/explain/{thread_id}
): Accepts athread_id
anddocid
to provide an explanation using the LangGraphgeneration_graph
.
The entire project is containerised using Docker or Podman compose (tested with Podman). The compose.yml
file gives more detail.
semantic_catalogue/ # main project directory
├── common # shared utility functions and settings
├── datastore # dagster configuration for data processing
│ ├── assets # dagster assets (e.g. datafiles)
│ ├── jobs.py # dagster jobs to combine and automate assets
│ ├── loaders.py # langchain loaders to create document objects
│ ├── resources.py # dagster-openai configuration
│ └── schedules.py # automate creation of assets
├── model # langgraph functions
│ ├── chains # chains that use prompts and llms (agents)
│ ├── graph.py # main graph definitions
│ ├── llms # openai llm and others if needed
│ ├── logging.py # basic logging definition
│ ├── main.py # wraps graphs for external use
│ ├── nodes # uses chains to define graph nodes
│ ├── retrievers # contains retrieval code using pinecone
│ └── states.py # defines graph states
└── search_api # fastapi code
13 directories, 32 files
Ensure you have the following installed:
- Python >=3.12,<3.13
- Docker or Podman
- Git
To develop this project, please follow these steps:
-
Clone the repository:
git clone https://github.com/cjber/semantic-catalogue.git
-
Navigate to the project directory:
cd semantic-catalogue
-
Install dependencies:
This project uses
uv
:uv sync
Alternatively,
pip
can also be used to install from apyproject.yaml
:pip install . # ensure you are using a venv
-
Configure the system:
Edit the
config/config.toml
file to customise model settings.
Warning
This project is not intended for public use and requires access to a private database.
To run the project successfully, it expects a .env
file with the following environment variables defined:
#!/bin/bash
export CDRC_USERNAME=""
export CDRC_PASSWORD=""
export CDRC_FORM_BUILD_ID=""
export PINECONE_API_KEY=""
export OPENAI_API_KEY=""
# optional
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=""
export LANGCHAIN_PROJECT=""
The CDRC
environment variables allow you to log in as a normal user, and download the documents. The first two are self-explanatory, but the third is more difficult to find. To obtain the CDRC_FORM_BUILD_ID
:
- Go to
https://data.cdrc.ac.uk/user/login
- Right click anywhere and choose
inspect
(On Firefox it says Inspect (Q)) - Click
network
in the inspect window, then type in your username and password and login - At the top of the
network
window there should be a 'File' calledlogin
, select this then click the 'Request' tab. - This should show you the
form_build_id
which corresponds with your user, and can be set as an environment variable.
The project is fully containerised using podman
/docker
compose
. To run the full system, execute:
podman compose up -d
Once up and running, access the semantic catalogue search API at http://localhost:8000
.
Test endpoints at
http://localhost:8000/docs
The Dagster UI is available at http://localhost:3000
. Adjust the Auto Materialise and Sensor settings to start the automation.
All scripts run independently, for debugging purposes. For example, to run the full LangGraph pipeline.
python -m semantic_catalogue.model.main
This submits a test query and print the outputs. Chains also provide outputs for testing:
python -m semantic_catalogue.model.chains.hallucination
This prints the output of a test generation that has hallucinations.
You can also start the API server independently:
fastapi dev semantic_catalogue/search_api/api.py