This repository contains Python scripts for extracting and analysing apprenticeship data from UK Department for Education (DfE) statistical releases.
Latest: Intelligent file discovery now automatically selects the most recent data files based on academic year and quarter/month patterns. See FILE_DISCOVERY.md for details.
Refactored: Codebase refactored for improved maintainability, reduced duplication, and better code quality. See REFACTORING.md for details.
Extracts Software Developer (Level 4) apprenticeship vacancy data from DfE vacancy CSV files and presents it in various formats suitable for analysis.
Features:
- Automatic file discovery: Finds and uses the most recent vacancy data file
- Filters vacancy data specifically for Software Developer apprenticeships
- Groups data by training provider and employer
- Provides multiple output formats (table, CSV, Markdown, TSV)
- Clean company name processing (removes legal suffixes like "Ltd", "PLC")
- Separates London vs other UK locations
- Aggregates small providers for better data presentation
Usage:
# Automatic discovery (uses most recent file)
python3 vacancies.py [options]
# Specify a file explicitly
python3 vacancies.py [options] [input_file]Options:
--csv,-c: Output in CSV format (suitable for importing into databases)--table: Output in table format (console-friendly aligned tables)--tsv,-t: Output in tab-separated format (for copy-paste into spreadsheets)--help,-h: Show help message
Default behaviour: Markdown table format using the most recent vacancy file
Output Format: Two tables showing:
- Providers Table: Training providers with employer count and total vacancies
- Employers Table: Detailed breakdown with employer, provider, location, and positions
The script intelligently groups data by:
- Detailed breakdown for providers with >10 apprenticeships
- Summary for providers with 4-10 apprenticeships
- Aggregated total for providers with ≤3 apprenticeships
Examples:
python3 vacancies.py # Markdown format, latest file
python3 vacancies.py --table # Console table format
python3 vacancies.py --csv # CSV format for import
python3 vacancies.py data/file.csv # Use specific fileExtracts apprenticeship starts data for a specific standard and presents it as a league table with years as columns and providers as rows.
Features:
- Automatic file discovery: Finds and uses the most recent starts data file
- Quarterly breakdown: Most recent year is broken down into Q1, Q2, Q3, Q4 columns
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Creates year-over-year comparison tables
- Shows providers with 3+ starts in most recent year separately
- Includes total row showing all starts across providers
- Automatically extracts from zip files if needed
Usage:
# Automatic discovery (uses most recent file)
python3 starts.py [options] [standard_code]
# Specify a file explicitly
python3 starts.py [options] [standard_code] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Standard: ST0116 (Software Developer)
Output Format: League table showing:
- Total row: Combined starts across all providers by year and quarter
- Major providers: Providers with 3+ total starts in most recent year
- All other providers: Aggregated smaller providers
- Most recent year: Broken down into Q1, Q2, Q3, Q4 columns for detailed analysis
Examples:
python3 starts.py # ST0116 (Software Developer), latest file
python3 starts.py ST0113 # ST0113, latest file
python3 starts.py ST0116 data.csv # ST0116, specific file
python3 starts.py --table ST0116 # Console table format
python3 starts.py --csv ST0113 # CSV outputExtracts monthly apprenticeship starts data for a specific standard and presents it as a table with years as columns and months as rows (in academic year order: Aug-Jul).
Features:
- Automatic file discovery: Finds and uses the most recent monthly starts file
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Creates month-by-month comparison across years
- Displays months in academic year order (August to July)
- Includes total row showing annual totals
Usage:
# Automatic discovery (uses most recent file)
python3 monthly.py [options] [standard_code]
# Specify a file explicitly
python3 monthly.py [options] [standard_code] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Standard: ST0116 (Software Developer)
Examples:
python3 monthly.py # ST0116, latest file
python3 monthly.py ST0113 # ST0113, latest file
python3 monthly.py ST0116 data.csv # ST0116, specific file
python3 monthly.py --table ST0113 # ST0113, table formatExtracts apprenticeship starts for a specific training provider and presents them by standard (apprenticeship type) with years as columns.
Features:
- Automatic file discovery: Finds and uses the most recent starts data file
- Filters data for any training provider (defaults to "FOUNDERS & CODERS")
- Shows all standards as individual rows
- Creates year-over-year comparison tables
- Includes total row showing all starts across standards
Usage:
# Automatic discovery (uses most recent file)
python3 provider.py [options] [provider_name]
# Specify a file explicitly
python3 provider.py [options] [provider_name] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Provider: FOUNDERS & CODERS
Examples:
python3 provider.py # FOUNDERS & CODERS, latest file
python3 provider.py "QA" # QA, latest file
python3 provider.py "MAKERS ACADEMY" # MAKERS ACADEMY, latest file
python3 provider.py --csv "MULTIVERSE GROUP" # MULTIVERSE GROUP, CSV formatExtracts apprenticeship starts by region for a specific standard, showing geographic distribution of apprenticeships.
Features:
- Automatic file discovery: Finds and uses the most recent starts data file
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Shows all regions individually (sorted by most recent year)
- Uses learner home region as proxy for employer location
- Includes total row showing all starts across regions
Usage:
# Automatic discovery (uses most recent file)
python3 regions.py [options] [standard_code]
# Specify a file explicitly
python3 regions.py [options] [standard_code] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Standard: ST0116 (Software Developer)
Examples:
python3 regions.py # ST0116, latest file
python3 regions.py ST0113 # ST0113, latest file
python3 regions.py --table # ST0116, table formatExtracts London-based SME apprenticeship starts for a specific standard, filtered by learner home region (London) and funding type (SME/other funding).
Features:
- Automatic file discovery: Finds and uses the most recent underlying starts file
- Filters for London learners with SME (non-levy) funding
- Includes manual adjustments for FOUNDERS & CODERS employer-provider apprenticeships
- Shows all providers sorted by most recent year starts
- Identifies and separates closed/rogue providers
Usage:
# Automatic discovery (uses most recent file)
python3 london_sme.py [options] [standard_code]
# Specify a file explicitly
python3 london_sme.py [options] [standard_code] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Standard: ST0116 (Software Developer)
Examples:
python3 london_sme.py # ST0116, latest file
python3 london_sme.py ST0113 # ST0113, latest file
python3 london_sme.py --table # ST0116, table formatExtracts apprenticeship starts by funding type (employer size) for a specific standard, showing the split between large employers (levy-funded) and SMEs (other funding).
Features:
- Automatic file discovery: Finds and uses the most recent underlying starts file
- Filters data for any apprenticeship standard code (defaults to ST0116)
- Shows funding type as proxy for employer size:
- "Large employers (levy-funded)" = Companies with £3m+ annual payroll
- "SMEs (other funding)" = Small/medium employers with government co-investment
- Includes total row showing all starts
Usage:
# Automatic discovery (uses most recent file)
python3 funding.py [options] [standard_code]
# Specify a file explicitly
python3 funding.py [options] [standard_code] [input_file]Options:
--csv,-c: Output in CSV format--table: Output in console table format--tsv,-t: Output in tab-separated format--help,-h: Show help message
Default Standard: ST0116 (Software Developer)
Examples:
python3 funding.py # ST0116, latest file
python3 funding.py ST0113 # ST0113, latest file
python3 funding.py --table # ST0116, table formatAll scripts automatically discover and use the most recent data files based on:
- Academic year (e.g., 2024-25 is newer than 2023-24)
- Quarter/month (e.g., Q3 is newer than Q2, Nov is newer than Mar)
Files are found in:
- Current directory
apprenticeships_*/supporting-files/folders
Supported filename patterns:
- Quarterly:
app-underlying-data-{type}-{year}-q{1-4}.csv- Example:
app-underlying-data-vacancies-202425-q2.csv
- Example:
- Monthly:
app-underlying-data-{type}-{year}-{month}.csv- Example:
app-underlying-data-monthly-202425-mar.csv
- Example:
See FILE_DISCOVERY.md for complete documentation.
The project uses a modular architecture with shared utilities:
vacancies.py- Vacancy data analysis (employer/provider breakdown)starts.py- Starts by provider (league table format)provider.py- Starts by standard for a specific providermonthly.py- Monthly starts breakdownregions.py- Geographic distribution of startslondon_sme.py- London SME apprenticeships analysisfunding.py- Funding type (employer size) analysisutils.py- Shared utilities (name cleaning, file discovery, table formatting)config.py- Configuration constants (thresholds, field names, patterns)test_utils.py- Unit tests for utility functionstest_file_discovery.py- Tests for file discovery logic
These scripts work with CSV files downloaded from the DfE's apprenticeship statistics releases:
Download the "Underlying data" files and place them in:
- Root directory, or
apprenticeships_YYYY-YY/supporting-files/folders
Scripts automatically find and use the most recent files.
All scripts support multiple output formats optimised for different use cases:
| Format | Use Case | Option |
|---|---|---|
| Markdown | Documentation, reports, Notion inline tables | Default |
| CSV | Import into databases, spreadsheets | --csv |
| TSV | Copy-paste into existing tables | --tsv |
| Table | Console viewing, terminal output | --table |
Runtime:
- Python 3.6+
- Standard library only (no external dependencies)
Development (optional):
pip3 install -r requirements.txtIncludes:
- pytest - for running tests
- mypy - for type checking (optional)
- black - for code formatting (optional)
- flake8 - for linting (optional)
Run the test suite to verify functionality:
# Run all tests
python3 test_utils.py
python3 test_file_discovery.py
# Or with pytest (if installed)
pytest test_*.py -vThresholds and settings can be adjusted in config.py:
# Provider categorisation thresholds
VACANCY_LARGE_PROVIDER_THRESHOLD = 10 # Providers with >10 positions
VACANCY_MEDIUM_PROVIDER_MIN = 4 # Providers with 4-10 positions
VACANCY_SMALL_PROVIDER_MAX = 3 # Providers with ≤3 positions
# Starts analysis
STARTS_MIN_THRESHOLD = 3 # Minimum starts to show separately
# Standard codes
DEFAULT_STANDARD_CODE = 'ST0116' # Software Developer Level 4- README.md (this file) - Overview and usage
- CLAUDE.md - Instructions for Claude Code development
- REFACTORING.md - Details of refactoring improvements
- FILE_DISCOVERY.md - Intelligent file discovery documentation
- requirements.txt - Development dependencies
# 1. Download latest DfE data files
# Place in root or apprenticeships_2024-25/supporting-files/
# 2. Run analysis scripts (automatically use latest files)
python3 vacancies.py --table
python3 starts.py ST0116 --csv
python3 monthly.py --tsv
# 3. Output can be redirected to files
python3 vacancies.py --csv > vacancies_output.csv
python3 starts.py --table ST0116 > starts_report.txt# Software Developer (Level 4) - ST0116
python3 starts.py ST0116
python3 monthly.py ST0116
python3 regions.py ST0116
python3 funding.py ST0116
python3 london_sme.py ST0116
# Machine Learning Engineer (Level 7) - ST1398
python3 starts.py ST1398
python3 regions.py ST1398
python3 funding.py ST1398
# Data Analyst (Level 4) - ST0118
python3 starts.py ST0118
python3 monthly.py ST0118
# Cyber Security Technologist (Level 3) - ST0622
python3 starts.py ST0622
python3 monthly.py ST0622# View all standards for a provider
python3 provider.py "FOUNDERS & CODERS"
python3 provider.py "QA"
python3 provider.py "MAKERS ACADEMY"
python3 provider.py "MULTIVERSE GROUP"# Use specific older file
python3 vacancies.py apprenticeships_2023-24/supporting-files/app-underlying-data-vacancies-202324-q4.csv
# Compare different quarters
python3 starts.py ST0116 app-data-starts-202324-q4.csv > q4_2023.txt
python3 starts.py ST0116 app-data-starts-202425-q2.csv > q2_2024.txt
diff q4_2023.txt q2_2024.txtSolution:
- Ensure files are named correctly:
app-underlying-data-{type}-{year}-{quarter}.csv - Check files are in root directory or
apprenticeships_*/supporting-files/ - Verify year format:
202425not2024-25
Debug:
from utils import find_latest_file
print(find_latest_file('app-underlying-data-vacancies-*.csv'))Solution: Specify file explicitly:
python3 vacancies.py path/to/specific/file.csvCheck:
- Verify standard code is correct (e.g.,
ST0116notST116) - Ensure CSV file contains data for the specified standard
- Check CSV field names match expected format
When modifying the code:
- Add configuration to
config.py(not hardcoded in scripts) - Add shared logic to
utils.py - Write tests for new functionality
- Use type hints on all functions
- Follow the coding standards in
CLAUDE.md
This code is provided for analysing publicly available DfE apprenticeship statistics.
For questions about the DfE data:
For issues with these scripts:
- Review the documentation files in this repository
- Check the test files for usage examples