The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes. By moving beyond traditional keyword-based searches, it provides users with more accurate and relevant results through semantic understanding.

Methodology · View Demo · Getting Started

Features

Semantic Search: Leverages OpenAI embeddings in Pinecone for context-aware dataset discovery
Retrieval Augmented Generation (RAG): Generates explanations with inline citations using GPT 4o-mini
Content Quality Assurance: Built-in moderation and hallucination detection
Automated Data Pipeline: Dagster-managed continuous updates to the vector database

System Architecture

The system uses Retrieval Augmented Generation (RAG) with two core components:

Retrieval: Finds relevant datasets using semantic similarity
Generation: Explains dataset relevance using retrieved information

System Architecture

Key components:

Backend: FastAPI & LangGraph for structured processing and API integration
Data Pipeline: Dagster for automated data lifecycle management
Vector Database: Pinecone for efficient semantic search operations
AI Models: OpenAI for semantic understanding and generation

Detailed Functionality

Data Management

Dagster automates the entire data pipeline - from ingestion to indexing - ensuring the Pinecone database stays current with new datasets.

Search & Generation

Semantic Search: Queries are converted to embeddings and matched against Pinecone's vector database
RAG Pipeline: Retrieved datasets are used to generate context-aware explanations
Quality Control: Moderation and hallucination detection ensure reliable outputs

FastAPI Endpoints

FastAPI is used to create a RESTful API allowing for the RAG graphs to interact with external services. The following endpoints are provided:

POST Query (/query): Accepts a search query and returns relevant documents alongside a unique thread ID associated with the query using the LangGraph search_graph.
GET Explain (/explain/{thread_id}): Accepts a thread_id and docid to provide an explanation using the LangGraph generation_graph.

Containerisation

The entire project is containerised using Docker or Podman compose (tested with Podman). The compose.yml file gives more detail.

Project Tree

semantic_catalogue/  # main project directory
├── common  # shared utility functions and settings
├── datastore  # dagster configuration for data processing
│   ├── assets  # dagster assets (e.g. datafiles)
│   ├── jobs.py  # dagster jobs to combine and automate assets
│   ├── loaders.py  # langchain loaders to create document objects
│   ├── resources.py  # dagster-openai configuration
│   └── schedules.py  # automate creation of assets
├── model  # langgraph functions
│   ├── chains  # chains that use prompts and llms (agents)
│   ├── graph.py  # main graph definitions
│   ├── llms  # openai llm and others if needed
│   ├── logging.py  # basic logging definition
│   ├── main.py  # wraps graphs for external use
│   ├── nodes  # uses chains to define graph nodes
│   ├── retrievers  # contains retrieval code using pinecone
│   └── states.py  # defines graph states
└── search_api  # fastapi code

13 directories, 32 files

Getting Started

Prerequisites

Ensure you have the following installed:

Python >=3.12,<3.13
Docker or Podman
Git

Development Setup

To develop this project, please follow these steps:

Clone the repository:

git clone https://github.com/cjber/semantic-catalogue.git

Navigate to the project directory:
```
cd semantic-catalogue
```
Install dependencies:

This project uses uv:
```
uv sync
```
Alternatively, pip can also be used to install from a pyproject.yaml:
```
pip install . # ensure you are using a venv
```
Configure the system:

Edit the config/config.toml file to customise model settings.

Running the Project

Warning

This project is not intended for public use and requires access to a private database.

To run the project successfully, it expects a .env file with the following environment variables defined:

#!/bin/bash

export CDRC_USERNAME=""
export CDRC_PASSWORD=""
export CDRC_FORM_BUILD_ID=""
export PINECONE_API_KEY=""
export OPENAI_API_KEY=""

# optional
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=""
export LANGCHAIN_PROJECT=""

The CDRC environment variables allow you to log in as a normal user, and download the documents. The first two are self-explanatory, but the third is more difficult to find. To obtain the CDRC_FORM_BUILD_ID:

Go to https://data.cdrc.ac.uk/user/login
Right click anywhere and choose inspect (On Firefox it says Inspect (Q))
Click network in the inspect window, then type in your username and password and login
At the top of the network window there should be a 'File' called login, select this then click the 'Request' tab.
This should show you the form_build_id which corresponds with your user, and can be set as an environment variable.

The project is fully containerised using podman/docker compose. To run the full system, execute:

podman compose up -d

Accessing the RESTful API

Once up and running, access the semantic catalogue search API at http://localhost:8000.

Test endpoints at http://localhost:8000/docs

Running the Dagster Pipeline

The Dagster UI is available at http://localhost:3000. Adjust the Auto Materialise and Sensor settings to start the automation.

Alternative Methods

All scripts run independently, for debugging purposes. For example, to run the full LangGraph pipeline.

python -m semantic_catalogue.model.main

This submits a test query and print the outputs. Chains also provide outputs for testing:

python -m semantic_catalogue.model.chains.hallucination

This prints the output of a test generation that has hallucinations.

You can also start the API server independently:

fastapi dev semantic_catalogue/search_api/api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Features

System Architecture

Detailed Functionality

Data Management

Search & Generation

FastAPI Endpoints

Containerisation

Project Tree

Getting Started

Prerequisites

Development Setup

Running the Project

Accessing the RESTful API

Running the Dagster Pipeline

Alternative Methods

Files

README.md

Latest commit

History

README.md

File metadata and controls

Features

System Architecture

Detailed Functionality

Data Management

Search & Generation

FastAPI Endpoints

Containerisation

Project Tree

Getting Started

Prerequisites

Development Setup

Running the Project

Accessing the RESTful API

Running the Dagster Pipeline

Alternative Methods