GitHub - cjber/semantic-catalogue: The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes.

The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes. By moving beyond traditional keyword-based searches, it provides users with more accurate and relevant results through semantic understanding.

Methodology · View Demo · Getting Started

Features

Semantic Search: Leverages OpenAI embeddings in Pinecone for context-aware dataset discovery
Retrieval Augmented Generation (RAG): Generates explanations with inline citations using GPT 4o-mini
Content Quality Assurance: Built-in moderation and hallucination detection
Automated Data Pipeline: Dagster-managed continuous updates to the vector database

System Architecture

The system uses Retrieval Augmented Generation (RAG) with two core components:

Retrieval: Finds relevant datasets using semantic similarity
Generation: Explains dataset relevance using retrieved information

System Architecture

Key components:

Backend: FastAPI & LangGraph for structured processing and API integration
Data Pipeline: Dagster for automated data lifecycle management
Vector Database: Pinecone for efficient semantic search operations
AI Models: OpenAI for semantic understanding and generation

Detailed Functionality

Data Management

Dagster automates the entire data pipeline - from ingestion to indexing - ensuring the Pinecone database stays current with new datasets.

Search & Generation

Semantic Search: Queries are converted to embeddings and matched against Pinecone's vector database
RAG Pipeline: Retrieved datasets are used to generate context-aware explanations
Quality Control: Moderation and hallucination detection ensure reliable outputs

FastAPI Endpoints

FastAPI is used to create a RESTful API allowing for the RAG graphs to interact with external services. The following endpoints are provided:

POST Query (/query): Accepts a search query and returns relevant documents alongside a unique thread ID associated with the query using the LangGraph search_graph.
GET Explain (/explain/{thread_id}): Accepts a thread_id and docid to provide an explanation using the LangGraph generation_graph.

Containerisation

The entire project is containerised using Docker or Podman compose (tested with Podman). The compose.yml file gives more detail.

Getting Started

Prerequisites

Ensure you have the following installed:

Python >=3.12,<3.13
Docker or Podman
Git

Contribution Setup

To contribute, please follow these steps:

Clone the repository:

git clone https://github.com/cjber/semantic-catalogue.git

Navigate to the project directory:
```
cd semantic-catalogue
```
Install dependencies:

This project uses uv:
```
uv sync
```
Alternatively, pip can also be used to install from a pyproject.yaml:
```
pip install . # ensure you are using a venv
```
Configure the system:

Edit the config/config.toml file to customise model settings.

Running the Project

Warning

This project is not intended for public use and requires access to a private database.

To run the project successfully, it expects a .env file with the following environment variables defined:

#!/bin/bash

export CDRC_USERNAME=""
export CDRC_PASSWORD=""
export CDRC_FORM_BUILD_ID=""
export PINECONE_API_KEY=""
export OPENAI_API_KEY=""
export LLAMA_CLOUD_API_KEY=""

# optional
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=""
export LANGCHAIN_PROJECT=""

The project is fully containerised using podman/docker compose. To run the full system, execute:

podman compose up -d

Accessing the Frontend

Once up and running, access the semantic catalogue search at http://localhost:8001.

Running the Dagster Pipeline

The Dagster UI is available at http://localhost:3000. Adjust the Auto Materialise and Sensor settings to start the automation.

Alternative Methods

All scripts run independently, for debugging purposes. For example, to run the full LangGraph pipeline.

python -m semantic_catalogue.model.main

This submits a test query and print the outputs. Chains also provide outputs for testing:

python -m semantic_catalogue.model.chains.hallucination

This prints the output of a test generation that has hallucinations.

You can also start the API server independently:

fastapi dev semantic_catalogue/search_api/api.py

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.github/workflows		.github/workflows
bm25		bm25
config		config
docs		docs
frontend		frontend
semantic_catalogue		semantic_catalogue
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Containerfile.dagster		Containerfile.dagster
Containerfile.fastapi		Containerfile.fastapi
README.md		README.md
compose.yml		compose.yml
dagster.yaml		dagster.yaml
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

System Architecture

Detailed Functionality

Data Management

Search & Generation

FastAPI Endpoints

Containerisation

Getting Started

Prerequisites

Contribution Setup

Running the Project

Accessing the Frontend

Running the Dagster Pipeline

Alternative Methods

About

Languages

cjber/semantic-catalogue

Folders and files

Latest commit

History

Repository files navigation

Features

System Architecture

Detailed Functionality

Data Management

Search & Generation

FastAPI Endpoints

Containerisation

Getting Started

Prerequisites

Contribution Setup

Running the Project

Accessing the Frontend

Running the Dagster Pipeline

Alternative Methods

About

Resources

Stars

Watchers

Forks

Languages