Skip to content

A Python-based data ingestion pipeline that downloads meeting summary JSON data from multiple GitHub URLs, validates structure compatibility, and stores it in a normalized PostgreSQL/Supabase schema.

Notifications You must be signed in to change notification settings

SingularityNET-Archive/data-ingestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Meeting Summaries Data Ingestion Pipeline

A Python-based data ingestion pipeline that downloads meeting summary JSON data from multiple GitHub URLs, validates structure compatibility, and stores it in a normalized PostgreSQL/Supabase schema.

Features

  • Multi-Source JSON Data Ingestion: Downloads and processes meeting summaries from multiple sources (2022-2025)
    • Supports historic data from 2022, 2023, 2024, and current data from 2025
    • Processes multiple JSON sources sequentially with transaction integrity
    • Continues processing remaining sources if one source fails
  • Structure Validation: Validates JSON compatibility before ingestion
    • Validates structure compatibility for all historic sources before processing any records
    • Handles missing optional fields and accepts additional fields for schema flexibility
  • Normalized Storage: Stores data in PostgreSQL with normalized relational tables and JSONB columns
  • Idempotent Processing: UPSERT operations prevent duplicates on re-runs
    • Last-write-wins strategy for overlapping records across sources
    • Safe to run multiple times without data corruption
  • Structured Logging: JSON-formatted logs with detailed error information
    • Includes source URL, error type, error message, and timestamp for all errors
  • GitHub Actions Deployment: Automated scheduled execution with zero infrastructure management

Prerequisites

  • Python 3.8+
  • PostgreSQL 12+ (or Supabase)
  • pip

Installation

Local Development

  1. Clone the repository:
git clone <repository-url>
cd data-ingestion
  1. Create a virtual environment:
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your database credentials
  1. Create database schema:
psql -U postgres -d meeting_summaries -f scripts/setup_db.sql

Usage

Basic Usage

# Ingest from default URLs (includes 2022, 2023, 2024, 2025 historic and current data)
python -m src.cli.ingest

# Specify custom URLs (processes multiple sources sequentially)
python -m src.cli.ingest \
  https://raw.githubusercontent.com/.../2025/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2024/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2023/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2022/meeting-summaries-array.json

# Dry run (validate without inserting)
python -m src.cli.ingest --dry-run

# Verbose logging
python -m src.cli.ingest --verbose

Historic Data Support

The pipeline supports ingestion of historic meeting summary data from multiple years:

  • Default Sources: The CLI includes default URLs for 2022, 2023, 2024, and 2025 data
  • Sequential Processing: Sources are processed sequentially to maintain transaction integrity
  • Error Handling: If one source fails to download or validate, processing continues with remaining sources
  • Structure Validation: Each historic source is validated for structure compatibility before processing
  • Idempotent: Running ingestion multiple times updates existing records without creating duplicates
  • Referential Integrity: All workgroups are processed first, then meetings, ensuring proper relationships

Command Options

  • --database-url, -d: PostgreSQL connection string (default: $DATABASE_URL)
  • --db-password, -p: Database password (if not in URL)
  • --dry-run: Validate and parse data without inserting to database
  • --verbose, -v: Enable verbose logging
  • --log-format: Log format (json or text, default: json)

Environment Variables

  • DATABASE_URL: PostgreSQL connection string (required)
  • DB_PASSWORD: Database password (if not in URL)
  • LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, default: INFO)
  • LOG_FORMAT: Log format (json or text, default: json)

Project Structure

.
├── README.md                    # Project overview and quick start
├── CHANGELOG.md                 # Version history and release notes
├── pyproject.toml               # Python project configuration
├── requirements.txt             # Python dependencies
├── .gitignore                  # Git ignore patterns
│
├── src/                         # Application source code
│   ├── models/                  # Database models and Pydantic schemas
│   ├── services/                # Business logic and data processing
│   ├── db/                      # Database connection and utilities
│   ├── cli/                     # Command-line interface
│   └── lib/                     # Shared utilities (logging, validation)
│
├── tests/                       # Test suite
│   ├── contract/                # Contract tests (schema validation)
│   ├── integration/             # Integration tests (end-to-end)
│   └── unit/                    # Unit tests (component isolation)
│
├── scripts/                     # Utility scripts
│   ├── setup_db.sql             # Database schema DDL
│   ├── check_duplicates.py      # Duplicate detection utility
│   ├── find_missing_meetings.py # Missing data analysis
│   ├── run_migration.py         # Database migration runner
│   └── verify_schema.py         # Schema validation utility
│
├── docs/                        # Project documentation
│   ├── README.md                # Documentation index
│   ├── operations/              # Operational guides
│   │   ├── runbook.md           # Operations runbook
│   │   ├── production-checklist.md
│   │   ├── troubleshooting.md
│   │   └── duplicate-diagnosis.md
│   ├── deployment/              # Deployment guides
│   │   ├── supabase-setup.md
│   │   ├── supabase-quickstart.md
│   │   └── deployment-options.md
│   └── archive/                 # Historical/archived files
│
└── specs/                       # Feature specifications and plans
    └── [feature-specs]/         # Individual feature specifications

Development

Running Tests

pytest

Code Formatting

black src/ tests/
ruff check src/ tests/

Type Checking

mypy src/

Deployment

Production Deployment Options

The ingestion pipeline can be deployed to Supabase using several methods:

Option 1: GitHub Actions (Recommended - No Containers)

Best for: Automated scheduled runs, zero infrastructure management

  1. Set up GitHub Secrets

    • Go to repository Settings → Secrets and variables → Actions
    • Add SUPABASE_DATABASE_URL with your Supabase connection string
  2. Workflow is Ready

    • The workflow file .github/workflows/ingest-meetings.yml is already configured
    • Runs daily at 2 AM UTC (or trigger manually from Actions tab)
  3. Verify

    • Check Actions tab for successful runs
    • Review logs in workflow artifacts

Option 2: Server/VM with Cron

For traditional server deployment:

  1. Set up server (EC2, GCE, Azure VM, etc.)
  2. Install Python and dependencies
  3. Configure cron job for scheduled execution
  4. Set environment variables in .env file

Option 3: Serverless Functions

Deploy as serverless function:

  • Google Cloud Functions: See docs/deployment/deployment-options.md
  • AWS Lambda: See docs/deployment/deployment-options.md

Detailed Deployment Guide

  • Supabase Setup Guide: docs/deployment/supabase-setup.md - Step-by-step instructions for setting up Supabase and GitHub Actions
  • Quick Start Checklist: docs/deployment/supabase-quickstart.md - Quick reference checklist for setup
  • Deployment Options: docs/deployment/deployment-options.md - Comprehensive guide for all deployment options

Production Checklist

See docs/operations/production-checklist.md for detailed production deployment checklist.

Local Development

See specs/001-meeting-summaries-ingestion/quickstart.md for local development setup instructions.

Documentation

Core Documentation

  • Specification: specs/001-meeting-summaries-ingestion/spec.md
  • Implementation Plan: specs/001-meeting-summaries-ingestion/plan.md
  • Data Model: specs/001-meeting-summaries-ingestion/data-model.md
  • Quickstart Guide: specs/001-meeting-summaries-ingestion/quickstart.md

Operations Documentation

  • Deployment Options: docs/deployment/deployment-options.md - Comprehensive guide for deployment options (GitHub Actions recommended, serverless, etc.)
  • Production Checklist: docs/operations/production-checklist.md - Environment configuration and pre-deployment verification
  • Troubleshooting Guide: docs/operations/troubleshooting.md - Common issues and solutions
  • Operations Runbook: docs/operations/runbook.md - Step-by-step operational procedures
  • Changelog: CHANGELOG.md - Version history and release notes

License

[Add license information]

About

A Python-based data ingestion pipeline that downloads meeting summary JSON data from multiple GitHub URLs, validates structure compatibility, and stores it in a normalized PostgreSQL/Supabase schema.

Resources

Stars

Watchers

Forks