Skip to content

foundersandcoders/app-data

Repository files navigation

DfE Apprenticeship Data Extraction Scripts

This repository contains Python scripts for extracting and analysing apprenticeship data from UK Department for Education (DfE) statistical releases.

Recent Updates

Latest: Intelligent file discovery now automatically selects the most recent data files based on academic year and quarter/month patterns. See FILE_DISCOVERY.md for details.

Refactored: Codebase refactored for improved maintainability, reduced duplication, and better code quality. See REFACTORING.md for details.

Scripts

vacancies.py

Extracts Software Developer (Level 4) apprenticeship vacancy data from DfE vacancy CSV files and presents it in various formats suitable for analysis.

Features:

  • Automatic file discovery: Finds and uses the most recent vacancy data file
  • Filters vacancy data specifically for Software Developer apprenticeships
  • Groups data by training provider and employer
  • Provides multiple output formats (table, CSV, Markdown, TSV)
  • Clean company name processing (removes legal suffixes like "Ltd", "PLC")
  • Separates London vs other UK locations
  • Aggregates small providers for better data presentation

Usage:

# Automatic discovery (uses most recent file)
python3 vacancies.py [options]

# Specify a file explicitly
python3 vacancies.py [options] [input_file]

Options:

  • --csv, -c: Output in CSV format (suitable for importing into databases)
  • --table: Output in table format (console-friendly aligned tables)
  • --tsv, -t: Output in tab-separated format (for copy-paste into spreadsheets)
  • --help, -h: Show help message

Default behaviour: Markdown table format using the most recent vacancy file

Output Format: Two tables showing:

  1. Providers Table: Training providers with employer count and total vacancies
  2. Employers Table: Detailed breakdown with employer, provider, location, and positions

The script intelligently groups data by:

  • Detailed breakdown for providers with >10 apprenticeships
  • Summary for providers with 4-10 apprenticeships
  • Aggregated total for providers with ≤3 apprenticeships

Examples:

python3 vacancies.py                    # Markdown format, latest file
python3 vacancies.py --table            # Console table format
python3 vacancies.py --csv              # CSV format for import
python3 vacancies.py data/file.csv      # Use specific file

starts.py

Extracts apprenticeship starts data for a specific standard and presents it as a league table with years as columns and providers as rows.

Features:

  • Automatic file discovery: Finds and uses the most recent starts data file
  • Quarterly breakdown: Most recent year is broken down into Q1, Q2, Q3, Q4 columns
  • Filters data for any apprenticeship standard code (defaults to ST0116)
  • Creates year-over-year comparison tables
  • Shows providers with 3+ starts in most recent year separately
  • Includes total row showing all starts across providers
  • Automatically extracts from zip files if needed

Usage:

# Automatic discovery (uses most recent file)
python3 starts.py [options] [standard_code]

# Specify a file explicitly
python3 starts.py [options] [standard_code] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Standard: ST0116 (Software Developer)

Output Format: League table showing:

  1. Total row: Combined starts across all providers by year and quarter
  2. Major providers: Providers with 3+ total starts in most recent year
  3. All other providers: Aggregated smaller providers
  4. Most recent year: Broken down into Q1, Q2, Q3, Q4 columns for detailed analysis

Examples:

python3 starts.py                       # ST0116 (Software Developer), latest file
python3 starts.py ST0113                # ST0113, latest file
python3 starts.py ST0116 data.csv       # ST0116, specific file
python3 starts.py --table ST0116        # Console table format
python3 starts.py --csv ST0113          # CSV output

monthly.py

Extracts monthly apprenticeship starts data for a specific standard and presents it as a table with years as columns and months as rows (in academic year order: Aug-Jul).

Features:

  • Automatic file discovery: Finds and uses the most recent monthly starts file
  • Filters data for any apprenticeship standard code (defaults to ST0116)
  • Creates month-by-month comparison across years
  • Displays months in academic year order (August to July)
  • Includes total row showing annual totals

Usage:

# Automatic discovery (uses most recent file)
python3 monthly.py [options] [standard_code]

# Specify a file explicitly
python3 monthly.py [options] [standard_code] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Standard: ST0116 (Software Developer)

Examples:

python3 monthly.py                      # ST0116, latest file
python3 monthly.py ST0113               # ST0113, latest file
python3 monthly.py ST0116 data.csv      # ST0116, specific file
python3 monthly.py --table ST0113       # ST0113, table format

provider.py

Extracts apprenticeship starts for a specific training provider and presents them by standard (apprenticeship type) with years as columns.

Features:

  • Automatic file discovery: Finds and uses the most recent starts data file
  • Filters data for any training provider (defaults to "FOUNDERS & CODERS")
  • Shows all standards as individual rows
  • Creates year-over-year comparison tables
  • Includes total row showing all starts across standards

Usage:

# Automatic discovery (uses most recent file)
python3 provider.py [options] [provider_name]

# Specify a file explicitly
python3 provider.py [options] [provider_name] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Provider: FOUNDERS & CODERS

Examples:

python3 provider.py                          # FOUNDERS & CODERS, latest file
python3 provider.py "QA"                     # QA, latest file
python3 provider.py "MAKERS ACADEMY"         # MAKERS ACADEMY, latest file
python3 provider.py --csv "MULTIVERSE GROUP" # MULTIVERSE GROUP, CSV format

regions.py

Extracts apprenticeship starts by region for a specific standard, showing geographic distribution of apprenticeships.

Features:

  • Automatic file discovery: Finds and uses the most recent starts data file
  • Filters data for any apprenticeship standard code (defaults to ST0116)
  • Shows all regions individually (sorted by most recent year)
  • Uses learner home region as proxy for employer location
  • Includes total row showing all starts across regions

Usage:

# Automatic discovery (uses most recent file)
python3 regions.py [options] [standard_code]

# Specify a file explicitly
python3 regions.py [options] [standard_code] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Standard: ST0116 (Software Developer)

Examples:

python3 regions.py              # ST0116, latest file
python3 regions.py ST0113       # ST0113, latest file
python3 regions.py --table      # ST0116, table format

london_sme.py

Extracts London-based SME apprenticeship starts for a specific standard, filtered by learner home region (London) and funding type (SME/other funding).

Features:

  • Automatic file discovery: Finds and uses the most recent underlying starts file
  • Filters for London learners with SME (non-levy) funding
  • Includes manual adjustments for FOUNDERS & CODERS employer-provider apprenticeships
  • Shows all providers sorted by most recent year starts
  • Identifies and separates closed/rogue providers

Usage:

# Automatic discovery (uses most recent file)
python3 london_sme.py [options] [standard_code]

# Specify a file explicitly
python3 london_sme.py [options] [standard_code] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Standard: ST0116 (Software Developer)

Examples:

python3 london_sme.py           # ST0116, latest file
python3 london_sme.py ST0113    # ST0113, latest file
python3 london_sme.py --table   # ST0116, table format

funding.py

Extracts apprenticeship starts by funding type (employer size) for a specific standard, showing the split between large employers (levy-funded) and SMEs (other funding).

Features:

  • Automatic file discovery: Finds and uses the most recent underlying starts file
  • Filters data for any apprenticeship standard code (defaults to ST0116)
  • Shows funding type as proxy for employer size:
    • "Large employers (levy-funded)" = Companies with £3m+ annual payroll
    • "SMEs (other funding)" = Small/medium employers with government co-investment
  • Includes total row showing all starts

Usage:

# Automatic discovery (uses most recent file)
python3 funding.py [options] [standard_code]

# Specify a file explicitly
python3 funding.py [options] [standard_code] [input_file]

Options:

  • --csv, -c: Output in CSV format
  • --table: Output in console table format
  • --tsv, -t: Output in tab-separated format
  • --help, -h: Show help message

Default Standard: ST0116 (Software Developer)

Examples:

python3 funding.py              # ST0116, latest file
python3 funding.py ST0113       # ST0113, latest file
python3 funding.py --table      # ST0116, table format

Intelligent File Discovery

All scripts automatically discover and use the most recent data files based on:

  • Academic year (e.g., 2024-25 is newer than 2023-24)
  • Quarter/month (e.g., Q3 is newer than Q2, Nov is newer than Mar)

Files are found in:

  1. Current directory
  2. apprenticeships_*/supporting-files/ folders

Supported filename patterns:

  • Quarterly: app-underlying-data-{type}-{year}-q{1-4}.csv
    • Example: app-underlying-data-vacancies-202425-q2.csv
  • Monthly: app-underlying-data-{type}-{year}-{month}.csv
    • Example: app-underlying-data-monthly-202425-mar.csv

See FILE_DISCOVERY.md for complete documentation.

Code Architecture

The project uses a modular architecture with shared utilities:

  • vacancies.py - Vacancy data analysis (employer/provider breakdown)
  • starts.py - Starts by provider (league table format)
  • provider.py - Starts by standard for a specific provider
  • monthly.py - Monthly starts breakdown
  • regions.py - Geographic distribution of starts
  • london_sme.py - London SME apprenticeships analysis
  • funding.py - Funding type (employer size) analysis
  • utils.py - Shared utilities (name cleaning, file discovery, table formatting)
  • config.py - Configuration constants (thresholds, field names, patterns)
  • test_utils.py - Unit tests for utility functions
  • test_file_discovery.py - Tests for file discovery logic

Data Sources

These scripts work with CSV files downloaded from the DfE's apprenticeship statistics releases:

Download the "Underlying data" files and place them in:

  • Root directory, or
  • apprenticeships_YYYY-YY/supporting-files/ folders

Scripts automatically find and use the most recent files.

Output Formats

All scripts support multiple output formats optimised for different use cases:

Format Use Case Option
Markdown Documentation, reports, Notion inline tables Default
CSV Import into databases, spreadsheets --csv
TSV Copy-paste into existing tables --tsv
Table Console viewing, terminal output --table

Requirements

Runtime:

  • Python 3.6+
  • Standard library only (no external dependencies)

Development (optional):

pip3 install -r requirements.txt

Includes:

  • pytest - for running tests
  • mypy - for type checking (optional)
  • black - for code formatting (optional)
  • flake8 - for linting (optional)

Testing

Run the test suite to verify functionality:

# Run all tests
python3 test_utils.py
python3 test_file_discovery.py

# Or with pytest (if installed)
pytest test_*.py -v

Configuration

Thresholds and settings can be adjusted in config.py:

# Provider categorisation thresholds
VACANCY_LARGE_PROVIDER_THRESHOLD = 10  # Providers with >10 positions
VACANCY_MEDIUM_PROVIDER_MIN = 4        # Providers with 4-10 positions
VACANCY_SMALL_PROVIDER_MAX = 3         # Providers with ≤3 positions

# Starts analysis
STARTS_MIN_THRESHOLD = 3               # Minimum starts to show separately

# Standard codes
DEFAULT_STANDARD_CODE = 'ST0116'       # Software Developer Level 4

Documentation

  • README.md (this file) - Overview and usage
  • CLAUDE.md - Instructions for Claude Code development
  • REFACTORING.md - Details of refactoring improvements
  • FILE_DISCOVERY.md - Intelligent file discovery documentation
  • requirements.txt - Development dependencies

Examples

Typical Workflow

# 1. Download latest DfE data files
# Place in root or apprenticeships_2024-25/supporting-files/

# 2. Run analysis scripts (automatically use latest files)
python3 vacancies.py --table
python3 starts.py ST0116 --csv
python3 monthly.py --tsv

# 3. Output can be redirected to files
python3 vacancies.py --csv > vacancies_output.csv
python3 starts.py --table ST0116 > starts_report.txt

Analysing Different Standards

# Software Developer (Level 4) - ST0116
python3 starts.py ST0116
python3 monthly.py ST0116
python3 regions.py ST0116
python3 funding.py ST0116
python3 london_sme.py ST0116

# Machine Learning Engineer (Level 7) - ST1398
python3 starts.py ST1398
python3 regions.py ST1398
python3 funding.py ST1398

# Data Analyst (Level 4) - ST0118
python3 starts.py ST0118
python3 monthly.py ST0118

# Cyber Security Technologist (Level 3) - ST0622
python3 starts.py ST0622
python3 monthly.py ST0622

Analysing Specific Providers

# View all standards for a provider
python3 provider.py "FOUNDERS & CODERS"
python3 provider.py "QA"
python3 provider.py "MAKERS ACADEMY"
python3 provider.py "MULTIVERSE GROUP"

Historical Data Analysis

# Use specific older file
python3 vacancies.py apprenticeships_2023-24/supporting-files/app-underlying-data-vacancies-202324-q4.csv

# Compare different quarters
python3 starts.py ST0116 app-data-starts-202324-q4.csv > q4_2023.txt
python3 starts.py ST0116 app-data-starts-202425-q2.csv > q2_2024.txt
diff q4_2023.txt q2_2024.txt

Troubleshooting

"No vacancy/starts data files found"

Solution:

  1. Ensure files are named correctly: app-underlying-data-{type}-{year}-{quarter}.csv
  2. Check files are in root directory or apprenticeships_*/supporting-files/
  3. Verify year format: 202425 not 2024-25

Script uses wrong file

Debug:

from utils import find_latest_file
print(find_latest_file('app-underlying-data-vacancies-*.csv'))

Solution: Specify file explicitly:

python3 vacancies.py path/to/specific/file.csv

No data in output

Check:

  1. Verify standard code is correct (e.g., ST0116 not ST116)
  2. Ensure CSV file contains data for the specified standard
  3. Check CSV field names match expected format

Contributing

When modifying the code:

  1. Add configuration to config.py (not hardcoded in scripts)
  2. Add shared logic to utils.py
  3. Write tests for new functionality
  4. Use type hints on all functions
  5. Follow the coding standards in CLAUDE.md

Licence

This code is provided for analysing publicly available DfE apprenticeship statistics.

Contact

For questions about the DfE data:

For issues with these scripts:

  • Review the documentation files in this repository
  • Check the test files for usage examples

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published