Skip to content

The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes.

Notifications You must be signed in to change notification settings

cjber/semantic-catalogue

Repository files navigation

Python Docker Dagster FastAPI
OpenAI LangChain


The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes. By moving beyond traditional keyword-based searches, it provides users with more accurate and relevant results through semantic understanding.


Methodology · View Demo · Getting Started

Features

  • Semantic Search: Leverages OpenAI embeddings in Pinecone for context-aware dataset discovery
  • Retrieval Augmented Generation (RAG): Generates explanations with inline citations using GPT 4o-mini
  • Content Quality Assurance: Built-in moderation and hallucination detection
  • Automated Data Pipeline: Dagster-managed continuous updates to the vector database

System Architecture

The system uses Retrieval Augmented Generation (RAG) with two core components:

  1. Retrieval: Finds relevant datasets using semantic similarity
  2. Generation: Explains dataset relevance using retrieved information

System Architecture System Architecture

Key components:

  • Backend: FastAPI & LangGraph for structured processing and API integration
  • Data Pipeline: Dagster for automated data lifecycle management
  • Vector Database: Pinecone for efficient semantic search operations
  • AI Models: OpenAI for semantic understanding and generation

Detailed Functionality

Data Management

Dagster automates the entire data pipeline - from ingestion to indexing - ensuring the Pinecone database stays current with new datasets.

Global Asset Lineage

Search & Generation

  1. Semantic Search: Queries are converted to embeddings and matched against Pinecone's vector database
  2. RAG Pipeline: Retrieved datasets are used to generate context-aware explanations
  3. Quality Control: Moderation and hallucination detection ensure reliable outputs

Search Graph

Generation Graph

FastAPI Endpoints

FastAPI is used to create a RESTful API allowing for the RAG graphs to interact with external services. The following endpoints are provided:

  • POST Query (/query): Accepts a search query and returns relevant documents alongside a unique thread ID associated with the query using the LangGraph search_graph.
  • GET Explain (/explain/{thread_id}): Accepts a thread_id and docid to provide an explanation using the LangGraph generation_graph.

Containerisation

The entire project is containerised using Docker or Podman compose (tested with Podman). The compose.yml file gives more detail.

Getting Started

Prerequisites

Ensure you have the following installed:

  • Python >=3.12,<3.13
  • Docker or Podman
  • Git

Contribution Setup

To contribute, please follow these steps:

  1. Clone the repository:

    git clone https://github.com/cjber/semantic-catalogue.git
  2. Navigate to the project directory:

    cd semantic-catalogue
  3. Install dependencies:

    This project uses uv:

    uv sync

    Alternatively, pip can also be used to install from a pyproject.yaml:

    pip install . # ensure you are using a venv
  4. Configure the system:

    Edit the config/config.toml file to customise model settings.

Running the Project

Warning

This project is not intended for public use and requires access to a private database.

To run the project successfully, it expects a .env file with the following environment variables defined:

#!/bin/bash

export CDRC_USERNAME=""
export CDRC_PASSWORD=""
export CDRC_FORM_BUILD_ID=""
export PINECONE_API_KEY=""
export OPENAI_API_KEY=""
export LLAMA_CLOUD_API_KEY=""

# optional
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY=""
export LANGCHAIN_PROJECT=""

The project is fully containerised using podman/docker compose. To run the full system, execute:

podman compose up -d

Accessing the Frontend

Once up and running, access the semantic catalogue search at http://localhost:8001.

Running the Dagster Pipeline

The Dagster UI is available at http://localhost:3000. Adjust the Auto Materialise and Sensor settings to start the automation.

Alternative Methods

All scripts run independently, for debugging purposes. For example, to run the full LangGraph pipeline.

python -m semantic_catalogue.model.main

This submits a test query and print the outputs. Chains also provide outputs for testing:

python -m semantic_catalogue.model.chains.hallucination

This prints the output of a test generation that has hallucinations.

You can also start the API server independently:

fastapi dev semantic_catalogue/search_api/api.py

About

The Semantic Catalogue is a project designed to enhance the search capabilities of data catalogues for research purposes.

Resources

Stars

Watchers

Forks