A Python-based data ingestion pipeline that downloads meeting summary JSON data from multiple GitHub URLs, validates structure compatibility, and stores it in a normalized PostgreSQL/Supabase schema.
- Multi-Source JSON Data Ingestion: Downloads and processes meeting summaries from multiple sources (2022-2025)
- Supports historic data from 2022, 2023, 2024, and current data from 2025
- Processes multiple JSON sources sequentially with transaction integrity
- Continues processing remaining sources if one source fails
- Structure Validation: Validates JSON compatibility before ingestion
- Validates structure compatibility for all historic sources before processing any records
- Handles missing optional fields and accepts additional fields for schema flexibility
- Normalized Storage: Stores data in PostgreSQL with normalized relational tables and JSONB columns
- Idempotent Processing: UPSERT operations prevent duplicates on re-runs
- Last-write-wins strategy for overlapping records across sources
- Safe to run multiple times without data corruption
- Structured Logging: JSON-formatted logs with detailed error information
- Includes source URL, error type, error message, and timestamp for all errors
- GitHub Actions Deployment: Automated scheduled execution with zero infrastructure management
- Python 3.8+
- PostgreSQL 12+ (or Supabase)
- pip
- Clone the repository:
git clone <repository-url>
cd data-ingestion- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your database credentials- Create database schema:
psql -U postgres -d meeting_summaries -f scripts/setup_db.sql# Ingest from default URLs (includes 2022, 2023, 2024, 2025 historic and current data)
python -m src.cli.ingest
# Specify custom URLs (processes multiple sources sequentially)
python -m src.cli.ingest \
https://raw.githubusercontent.com/.../2025/meeting-summaries-array.json \
https://raw.githubusercontent.com/.../2024/meeting-summaries-array.json \
https://raw.githubusercontent.com/.../2023/meeting-summaries-array.json \
https://raw.githubusercontent.com/.../2022/meeting-summaries-array.json
# Dry run (validate without inserting)
python -m src.cli.ingest --dry-run
# Verbose logging
python -m src.cli.ingest --verboseThe pipeline supports ingestion of historic meeting summary data from multiple years:
- Default Sources: The CLI includes default URLs for 2022, 2023, 2024, and 2025 data
- Sequential Processing: Sources are processed sequentially to maintain transaction integrity
- Error Handling: If one source fails to download or validate, processing continues with remaining sources
- Structure Validation: Each historic source is validated for structure compatibility before processing
- Idempotent: Running ingestion multiple times updates existing records without creating duplicates
- Referential Integrity: All workgroups are processed first, then meetings, ensuring proper relationships
--database-url,-d: PostgreSQL connection string (default:$DATABASE_URL)--db-password,-p: Database password (if not in URL)--dry-run: Validate and parse data without inserting to database--verbose,-v: Enable verbose logging--log-format: Log format (jsonortext, default:json)
DATABASE_URL: PostgreSQL connection string (required)DB_PASSWORD: Database password (if not in URL)LOG_LEVEL: Logging level (DEBUG,INFO,WARNING,ERROR, default:INFO)LOG_FORMAT: Log format (jsonortext, default:json)
.
├── README.md # Project overview and quick start
├── CHANGELOG.md # Version history and release notes
├── pyproject.toml # Python project configuration
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore patterns
│
├── src/ # Application source code
│ ├── models/ # Database models and Pydantic schemas
│ ├── services/ # Business logic and data processing
│ ├── db/ # Database connection and utilities
│ ├── cli/ # Command-line interface
│ └── lib/ # Shared utilities (logging, validation)
│
├── tests/ # Test suite
│ ├── contract/ # Contract tests (schema validation)
│ ├── integration/ # Integration tests (end-to-end)
│ └── unit/ # Unit tests (component isolation)
│
├── scripts/ # Utility scripts
│ ├── setup_db.sql # Database schema DDL
│ ├── check_duplicates.py # Duplicate detection utility
│ ├── find_missing_meetings.py # Missing data analysis
│ ├── run_migration.py # Database migration runner
│ └── verify_schema.py # Schema validation utility
│
├── docs/ # Project documentation
│ ├── README.md # Documentation index
│ ├── operations/ # Operational guides
│ │ ├── runbook.md # Operations runbook
│ │ ├── production-checklist.md
│ │ ├── troubleshooting.md
│ │ └── duplicate-diagnosis.md
│ ├── deployment/ # Deployment guides
│ │ ├── supabase-setup.md
│ │ ├── supabase-quickstart.md
│ │ └── deployment-options.md
│ └── archive/ # Historical/archived files
│
└── specs/ # Feature specifications and plans
└── [feature-specs]/ # Individual feature specifications
pytestblack src/ tests/
ruff check src/ tests/mypy src/The ingestion pipeline can be deployed to Supabase using several methods:
Best for: Automated scheduled runs, zero infrastructure management
-
Set up GitHub Secrets
- Go to repository Settings → Secrets and variables → Actions
- Add
SUPABASE_DATABASE_URLwith your Supabase connection string
-
Workflow is Ready
- The workflow file
.github/workflows/ingest-meetings.ymlis already configured - Runs daily at 2 AM UTC (or trigger manually from Actions tab)
- The workflow file
-
Verify
- Check Actions tab for successful runs
- Review logs in workflow artifacts
For traditional server deployment:
- Set up server (EC2, GCE, Azure VM, etc.)
- Install Python and dependencies
- Configure cron job for scheduled execution
- Set environment variables in
.envfile
Deploy as serverless function:
- Google Cloud Functions: See
docs/deployment/deployment-options.md - AWS Lambda: See
docs/deployment/deployment-options.md
- Supabase Setup Guide:
docs/deployment/supabase-setup.md- Step-by-step instructions for setting up Supabase and GitHub Actions - Quick Start Checklist:
docs/deployment/supabase-quickstart.md- Quick reference checklist for setup - Deployment Options:
docs/deployment/deployment-options.md- Comprehensive guide for all deployment options
See docs/operations/production-checklist.md for detailed production deployment checklist.
See specs/001-meeting-summaries-ingestion/quickstart.md for local development setup instructions.
- Specification:
specs/001-meeting-summaries-ingestion/spec.md - Implementation Plan:
specs/001-meeting-summaries-ingestion/plan.md - Data Model:
specs/001-meeting-summaries-ingestion/data-model.md - Quickstart Guide:
specs/001-meeting-summaries-ingestion/quickstart.md
- Deployment Options:
docs/deployment/deployment-options.md- Comprehensive guide for deployment options (GitHub Actions recommended, serverless, etc.) - Production Checklist:
docs/operations/production-checklist.md- Environment configuration and pre-deployment verification - Troubleshooting Guide:
docs/operations/troubleshooting.md- Common issues and solutions - Operations Runbook:
docs/operations/runbook.md- Step-by-step operational procedures - Changelog:
CHANGELOG.md- Version history and release notes
[Add license information]