Skip to content

dathere/pose-ckanext-metadata

Repository files navigation

Datapump CKAN Sites Timeseries Metadata Datapump CKAN Extensions Timeseries Metadata

CKAN Ecosystem Metadata Pipelines

Data pipeline workflows for continuously cataloging metadata from CKAN instances and extensions worldwide. Powers the CKAN Ecosystem Catalog with real-time insights into the open data infrastructure landscape.

CKAN


Pipeline Details

Extensions Pipeline

Trigger: Every Sunday at 02:00 UTC (or manual dispatch)

Stages:

  1. Discovery (1getURL.py)

    • Queries CKAN catalog for extension repositories
    • Outputs: url_list.csv with GitHub URLs
  2. Metadata Collection (2refresh.py)

    • Fetches GitHub metrics via REST API
    • Metrics: stars, forks, releases, contributors, issues
    • Outputs: dynamic_metadata_update.csv
  3. Catalog Sync (3updateCatalog.py)

    • Updates CKAN package metadata
    • Atomic updates with rollback on failure
  4. Time-Series Storage (datapump.py)

    • Appends daily snapshots to datastore
    • Enables historical trend analysis

CKAN Instance Data Collection (sites-data-fetch/)

Work in Progress

Sites Pipeline

Trigger: Every Sunday at 03:00 UTC (1 hour after extensions)

Stages:

  1. Site Discovery (1getSitesURL.py)

    • Extracts known CKAN instances from catalog
    • Outputs: site_urls.csv
  2. Instance Profiling (2CKANActionAPI.py)

    • Queries CKAN Action API (/api/3/action/status_show)
    • Fetches: datasets, groups, organizations, version, extensions
    • Concurrent processing: 10 workers, 15s timeout
    • Outputs: ckan_stats.csv
  3. Catalog Update (3updateSitesCatalog.py)

    • Syncs instance metadata to catalog
  4. Time-Series Storage (datapump.py)

    • Appends instance snapshots to datastore

Getting Started

Prerequisites

  • Python 3.9+
  • CKAN API access with write permissions
  • GitHub Personal Access Token (for extensions pipeline)

Configuration

Set up Github secret variables:

CKAN_API_KEY="your-ckan-api-key"
GITHUB_TOKEN="your-github-token"  # For extensions pipeline

Automation

GitHub Actions Workflows

Both pipelines run automatically via GitHub Actions:

  • Extensions: Sundays at 02:00 UTC
  • Sites: Sundays at 03:00 UTC (staggered to avoid resource contention)

Manual Triggering:

  1. Navigate to Actions tab in GitHub
  2. Select workflow
  3. Click "Run workflow"

Monitoring:

  • Workflow status badges in README
  • Artifact uploads on success (CSV files, 30-day retention)
  • Debug artifact uploads on failure (logs, 7-day retention)
  • Detailed execution summaries with file metrics

Data Access

Public Catalog

Browse and download data via the CKAN Ecosystem Catalog:

  • Extensions Dataset: ckan-extensions-metadata
  • Sites Dataset: ckan-sites-metadata

Project managed by

image image image

Funding provided through the National Science Foundation's Pathways to Enable Open Source Ecosystems (POSE) program.

image

About

Scripts for discovering, collecting, and cataloging metadata from CKAN extensions and instances worldwide.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages