Skip to content

helpers-no/software-scrape

Repository files navigation

Software Scrape

A TypeScript tool for building a curated catalog of European and open-source software alternatives.

Purpose

This project scrapes government-vetted and expert-curated software catalogs to build a comprehensive dataset for sovereignsky.no - a website helping organizations find digital sovereignty-friendly software alternatives.

Priority Sources

We prioritize government-vetted catalogs (trust level 5) for quality and legal compliance:

Source Country Description
SILL France FR French government open source catalog
openCode.de DE German public sector software catalog
Developers Italia IT Italian government software catalog
EU OSS Catalogue EU European Commission open source catalog

See catalog/software-sources.json for the complete source registry with trust levels.

Key Features

  • Multi-taxonomy categorization - Business, technical, developer, and platform categories
  • Alternative mapping - Connect proprietary tools to open-source alternatives
  • Government focus - Prioritize EU-based, GDPR-compliant solutions
  • Automated validation - 135 tests ensure data quality
  • Modular architecture - Domain-based structure for multi-domain support

Quick Start

All commands run inside the devcontainer. See CLAUDE.md.

# Install dependencies
npm install

# Run a scraper
npm run scrape euro-stack

# Run tests
npm test

# Build catalog
npm run catalog:build

# Validate
npm run catalog:validate

Documentation

Document Description
CLAUDE.md LLM instructions, devcontainer setup
docs/DATA-FLOW.md Data pipeline, scripts, test suite, weekly workflow
docs/FOLDER-STRUCTURE.md Project directory organization
docs/software-catalog-spec.md Schema specification
src/domains/software/scrapers/README.md Scraper development guide
catalog/README.md Catalog schema documentation

Project Structure

src/
├── domains/software/   # Software catalog domain
│   ├── scrapers/       # Individual scraper implementations
│   ├── catalog/        # Modular catalog building (sources, enrichments, vendors)
│   └── lib/            # Domain utilities (category-resolver)
├── lib/                # Shared utilities (http, logger, text, dates)
├── types/              # Shared TypeScript interfaces
├── commands/           # CLI tools (check-duplicates, suggest-canonical)
└── catalog/            # Catalog build orchestration

scraped/normalized/     # Scraper output (JSON files)
catalog/                # Canonical catalog data and schemas
data/software/          # Deliverable output (for Hugo/PWA)
tests/                  # Integration tests
docs/                   # Documentation

See docs/FOLDER-STRUCTURE.md for complete structure.

Implemented Scrapers

Scraper Products Description
euro-stack 1,103 European software alternatives
switching-software 130 Privacy-focused alternatives
cncf-landscape 1,354 Cloud-native infrastructure
cloud-service-map 575 AWS/Azure/GCP equivalents
openalternative 656 Open source alternatives
sill-france 625 French government catalog

Adding a New Scraper

  1. Create folder src/domains/software/scrapers/<source>/
  2. Implement index.ts with run() returning Software[]
  3. Use CategoryResolver for category mapping
  4. Add to test suite in src/domains/software/scrapers/output-validation.test.ts
  5. Run npm test to validate

See src/domains/software/scrapers/SCRAPER-TEMPLATE.md for the full checklist.

Testing

npm test                    # Run all tests
npm run test:watch          # Watch mode
npm run test:coverage       # With coverage

See docs/DATA-FLOW.md#test-suite for test details.

In Development

  • SILL France scraper (government catalog) - Implemented
  • openCode.de scraper (government catalog)
  • awesome-selfhosted scraper
  • CI/CD pipeline

About

Tool to find information about sw tools and their alternatives

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors