Meeting Summaries Data Ingestion Pipeline

A Python-based data ingestion pipeline that downloads meeting summary JSON data from multiple GitHub URLs, validates structure compatibility, and stores it in a normalized PostgreSQL/Supabase schema.

Features

Multi-Source JSON Data Ingestion: Downloads and processes meeting summaries from multiple sources (2022-2025)
- Supports historic data from 2022, 2023, 2024, and current data from 2025
- Processes multiple JSON sources sequentially with transaction integrity
- Continues processing remaining sources if one source fails
Structure Validation: Validates JSON compatibility before ingestion
- Validates structure compatibility for all historic sources before processing any records
- Handles missing optional fields and accepts additional fields for schema flexibility
Normalized Storage: Stores data in PostgreSQL with normalized relational tables and JSONB columns
Idempotent Processing: UPSERT operations prevent duplicates on re-runs
- Last-write-wins strategy for overlapping records across sources
- Safe to run multiple times without data corruption
Structured Logging: JSON-formatted logs with detailed error information
- Includes source URL, error type, error message, and timestamp for all errors
GitHub Actions Deployment: Automated scheduled execution with zero infrastructure management

Prerequisites

Python 3.8+
PostgreSQL 12+ (or Supabase)
pip

Installation

Local Development

Clone the repository:

git clone <repository-url>
cd data-ingestion

Create a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env with your database credentials

Create database schema:

psql -U postgres -d meeting_summaries -f scripts/setup_db.sql

Usage

Basic Usage

# Ingest from default URLs (includes 2022, 2023, 2024, 2025 historic and current data)
python -m src.cli.ingest

# Specify custom URLs (processes multiple sources sequentially)
python -m src.cli.ingest \
  https://raw.githubusercontent.com/.../2025/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2024/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2023/meeting-summaries-array.json \
  https://raw.githubusercontent.com/.../2022/meeting-summaries-array.json

# Dry run (validate without inserting)
python -m src.cli.ingest --dry-run

# Verbose logging
python -m src.cli.ingest --verbose

Historic Data Support

The pipeline supports ingestion of historic meeting summary data from multiple years:

Default Sources: The CLI includes default URLs for 2022, 2023, 2024, and 2025 data
Sequential Processing: Sources are processed sequentially to maintain transaction integrity
Error Handling: If one source fails to download or validate, processing continues with remaining sources
Structure Validation: Each historic source is validated for structure compatibility before processing
Idempotent: Running ingestion multiple times updates existing records without creating duplicates
Referential Integrity: All workgroups are processed first, then meetings, ensuring proper relationships

Command Options

--database-url, -d: PostgreSQL connection string (default: $DATABASE_URL)
--db-password, -p: Database password (if not in URL)
--dry-run: Validate and parse data without inserting to database
--verbose, -v: Enable verbose logging
--log-format: Log format (json or text, default: json)

Environment Variables

DATABASE_URL: PostgreSQL connection string (required)
DB_PASSWORD: Database password (if not in URL)
LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, default: INFO)
LOG_FORMAT: Log format (json or text, default: json)

Project Structure

.
├── README.md                    # Project overview and quick start
├── CHANGELOG.md                 # Version history and release notes
├── pyproject.toml               # Python project configuration
├── requirements.txt             # Python dependencies
├── .gitignore                  # Git ignore patterns
│
├── src/                         # Application source code
│   ├── models/                  # Database models and Pydantic schemas
│   ├── services/                # Business logic and data processing
│   ├── db/                      # Database connection and utilities
│   ├── cli/                     # Command-line interface
│   └── lib/                     # Shared utilities (logging, validation)
│
├── tests/                       # Test suite
│   ├── contract/                # Contract tests (schema validation)
│   ├── integration/             # Integration tests (end-to-end)
│   └── unit/                    # Unit tests (component isolation)
│
├── scripts/                     # Utility scripts
│   ├── setup_db.sql             # Database schema DDL
│   ├── check_duplicates.py      # Duplicate detection utility
│   ├── find_missing_meetings.py # Missing data analysis
│   ├── run_migration.py         # Database migration runner
│   └── verify_schema.py         # Schema validation utility
│
├── docs/                        # Project documentation
│   ├── README.md                # Documentation index
│   ├── operations/              # Operational guides
│   │   ├── runbook.md           # Operations runbook
│   │   ├── production-checklist.md
│   │   ├── troubleshooting.md
│   │   └── duplicate-diagnosis.md
│   ├── deployment/              # Deployment guides
│   │   ├── supabase-setup.md
│   │   ├── supabase-quickstart.md
│   │   └── deployment-options.md
│   └── archive/                 # Historical/archived files
│
└── specs/                       # Feature specifications and plans
    └── [feature-specs]/         # Individual feature specifications

Development

Running Tests

pytest

Code Formatting

black src/ tests/
ruff check src/ tests/

Type Checking

mypy src/

Deployment

Production Deployment Options

The ingestion pipeline can be deployed to Supabase using several methods:

Option 1: GitHub Actions (Recommended - No Containers)

Best for: Automated scheduled runs, zero infrastructure management

Set up GitHub Secrets
- Go to repository Settings → Secrets and variables → Actions
- Add SUPABASE_DATABASE_URL with your Supabase connection string
Workflow is Ready
- The workflow file .github/workflows/ingest-meetings.yml is already configured
- Runs daily at 2 AM UTC (or trigger manually from Actions tab)
Verify
- Check Actions tab for successful runs
- Review logs in workflow artifacts

Option 2: Server/VM with Cron

For traditional server deployment:

Set up server (EC2, GCE, Azure VM, etc.)
Install Python and dependencies
Configure cron job for scheduled execution
Set environment variables in .env file

Option 3: Serverless Functions

Deploy as serverless function:

Google Cloud Functions: See docs/deployment/deployment-options.md
AWS Lambda: See docs/deployment/deployment-options.md

Detailed Deployment Guide

Supabase Setup Guide: docs/deployment/supabase-setup.md - Step-by-step instructions for setting up Supabase and GitHub Actions
Quick Start Checklist: docs/deployment/supabase-quickstart.md - Quick reference checklist for setup
Deployment Options: docs/deployment/deployment-options.md - Comprehensive guide for all deployment options

Production Checklist

See docs/operations/production-checklist.md for detailed production deployment checklist.

Local Development

See specs/001-meeting-summaries-ingestion/quickstart.md for local development setup instructions.

Documentation

Core Documentation

Specification: specs/001-meeting-summaries-ingestion/spec.md
Implementation Plan: specs/001-meeting-summaries-ingestion/plan.md
Data Model: specs/001-meeting-summaries-ingestion/data-model.md
Quickstart Guide: specs/001-meeting-summaries-ingestion/quickstart.md

Operations Documentation

Deployment Options: docs/deployment/deployment-options.md - Comprehensive guide for deployment options (GitHub Actions recommended, serverless, etc.)
Production Checklist: docs/operations/production-checklist.md - Environment configuration and pre-deployment verification
Troubleshooting Guide: docs/operations/troubleshooting.md - Common issues and solutions
Operations Runbook: docs/operations/runbook.md - Step-by-step operational procedures
Changelog: CHANGELOG.md - Version history and release notes

License

[Add license information]

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.cursor		.cursor
.github		.github
.specify		.specify
backend		backend
data		data
deploy/dashboard		deploy/dashboard
docs		docs
frontend/web		frontend/web
scripts		scripts
specs		specs
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
OPERATIONS_RUNBOOK.md		OPERATIONS_RUNBOOK.md
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
SUPABASE_QUICK_START.md		SUPABASE_QUICK_START.md
SUPABASE_SETUP_GUIDE.md		SUPABASE_SETUP_GUIDE.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
backup_before_dedupe.dump		backup_before_dedupe.dump
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Meeting Summaries Data Ingestion Pipeline

Features

Prerequisites

Installation

Local Development

Usage

Basic Usage

Historic Data Support

Command Options

Environment Variables

Project Structure

Development

Running Tests

Code Formatting

Type Checking

Deployment

Production Deployment Options

Option 1: GitHub Actions (Recommended - No Containers)

Option 2: Server/VM with Cron

Option 3: Serverless Functions

Detailed Deployment Guide

Production Checklist

Local Development

Documentation

Core Documentation

Operations Documentation

License

About

Uh oh!

Languages

SingularityNET-Archive/data-ingestion

Folders and files

Latest commit

History

Repository files navigation

Meeting Summaries Data Ingestion Pipeline

Features

Prerequisites

Installation

Local Development

Usage

Basic Usage

Historic Data Support

Command Options

Environment Variables

Project Structure

Development

Running Tests

Code Formatting

Type Checking

Deployment

Production Deployment Options

Option 1: GitHub Actions (Recommended - No Containers)

Option 2: Server/VM with Cron

Option 3: Serverless Functions

Detailed Deployment Guide

Production Checklist

Local Development

Documentation

Core Documentation

Operations Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages