Data pipeline workflows for continuously cataloging metadata from CKAN instances and extensions worldwide. Powers the CKAN Ecosystem Catalog with real-time insights into the open data infrastructure landscape.
Trigger: Every Sunday at 02:00 UTC (or manual dispatch)
Stages:
-
Discovery (
1getURL.py)- Queries CKAN catalog for extension repositories
- Outputs:
url_list.csvwith GitHub URLs
-
Metadata Collection (
2refresh.py)- Fetches GitHub metrics via REST API
- Metrics: stars, forks, releases, contributors, issues
- Outputs:
dynamic_metadata_update.csv
-
Catalog Sync (
3updateCatalog.py)- Updates CKAN package metadata
- Atomic updates with rollback on failure
-
Time-Series Storage (
datapump.py)- Appends daily snapshots to datastore
- Enables historical trend analysis
Work in Progress
Trigger: Every Sunday at 03:00 UTC (1 hour after extensions)
Stages:
-
Site Discovery (
1getSitesURL.py)- Extracts known CKAN instances from catalog
- Outputs:
site_urls.csv
-
Instance Profiling (
2CKANActionAPI.py)- Queries CKAN Action API (
/api/3/action/status_show) - Fetches: datasets, groups, organizations, version, extensions
- Concurrent processing: 10 workers, 15s timeout
- Outputs:
ckan_stats.csv
- Queries CKAN Action API (
-
Catalog Update (
3updateSitesCatalog.py)- Syncs instance metadata to catalog
-
Time-Series Storage (
datapump.py)- Appends instance snapshots to datastore
- Python 3.9+
- CKAN API access with write permissions
- GitHub Personal Access Token (for extensions pipeline)
Set up Github secret variables:
CKAN_API_KEY="your-ckan-api-key"
GITHUB_TOKEN="your-github-token" # For extensions pipelineBoth pipelines run automatically via GitHub Actions:
- Extensions: Sundays at 02:00 UTC
- Sites: Sundays at 03:00 UTC (staggered to avoid resource contention)
Manual Triggering:
- Navigate to Actions tab in GitHub
- Select workflow
- Click "Run workflow"
Monitoring:
- Workflow status badges in README
- Artifact uploads on success (CSV files, 30-day retention)
- Debug artifact uploads on failure (logs, 7-day retention)
- Detailed execution summaries with file metrics
Browse and download data via the CKAN Ecosystem Catalog:
- Extensions Dataset:
ckan-extensions-metadata - Sites Dataset:
ckan-sites-metadata
Project managed by
Funding provided through the National Science Foundation's Pathways to Enable Open Source Ecosystems (POSE) program.