"Define declarative contracts, run pytest-style assertions on your image datasets, and prevent training regressions in CI/CD."
imgshape is a CI/CD-native developer tool and Python library for vision ML dataset governance. It transitions from heuristics to a strict contract-driven validation framework.
By running deterministic audits across spatial, signal, distribution, quality, and semantic dimensions, imgshape ensures that your training and validation pipelines run on verified, regression-free, and high-fidelity datasets.
- π Declarative Dataset Contracts: Write YAML schemas enforcing channel configurations, format restrictions, resolution bounds, entropy ranges, and maximum allowed corruption or duplicate rates.
- π§ͺ pytest-style Dataset Testing (
imgshape test): Run automated assertions on image folders with clean, tabular console summaries and markdown/JSON output renderers. - π Git-style Dataset Diffing (
imgshape diff): Compare a baseline and candidate dataset to detect statistical shifts, class imbalances, and semantic drift using DINOv2 embeddings. - π Cryptographic Audit Trails: Generates content-hashed
provenance_idmetadata and writes cross-platform.fingerprint_locklockfiles to track and seal dataset state. - π Ultra-lean & Portable: 100% Python library with a CLI. No Node.js, React UI, Streamlit, or local web servers. Fits perfectly into GitHub Actions, GitLab CI, or local terminals.
# Install core package (minimal dependencies)
pip install imgshape
# Install with PyTorch support for semantic drift & GPU acceleration
pip install "imgshape[full]"Create a contract file to define the expected boundaries of your dataset:
schema_version: "5.0"
dataset:
expected_channels: 3
allowed_formats: [png, jpg]
resolution_min: [224, 224]
resolution_max: [1024, 1024]
quality:
blur_threshold: 1.5
corruption_max: 0.01
duplicate_max: {value: 0.05, severity: warning}
distribution:
entropy_min: 3.5
imbalance_ratio_max: 2.0# Validate dataset against contract (exits with non-zero code on error)
imgshape validate ./my_dataset_directory contract.yaml --lockThe --lock flag automatically writes a .fingerprint_lock metadata file alongside the contract to secure your dataset version signature.
Compare candidate dataset against baseline fingerprint to verify drift:
# Compare candidate folder against baseline fingerprint
imgshape diff baseline_fingerprint.json ./new_candidate_dataset/ --save diff_report.mdYou can embed contract governance directly into your training scripts or data preparation notebooks:
from pathlib import Path
from imgshape.atlas import Atlas
from imgshape.contract import ContractLoader, ContractValidator
# 1. Profile the dataset
atlas = Atlas()
fingerprint = atlas.extract(Path("./my_dataset"))
# 2. Load the contract
contract = ContractLoader.load_yaml(Path("contract.yaml"))
# 3. Validate
validator = ContractValidator(contract)
report = validator.validate(fingerprint)
if report.passed:
print(f"β
Dataset validated successfully! Provenance ID: {report.provenance_id}")
else:
print("β Dataset contract validation failed:")
for violation in report.violations:
print(f" - [{violation.severity.upper()}] {violation.clause}: {violation.message}")imgshape operates as a strict quality gate between your data storage layers and model training environments.
graph TD
subgraph "Data & Spec"
A[Raw Image Dataset]
B[YAML Dataset Contract]
end
subgraph "imgshape Core (Atlas Engine)"
C[Atlas Profilers] -->|Extract Metrics| D[Dataset Fingerprint]
B -->|Parse Schema| E[Contract Validator]
D --> E
end
subgraph "Outputs & Actions"
E -->|Exit 0 / 1 / 2| F[CI/CD Build Verdict]
E -->|Lockfile| G[.fingerprint_lock]
E -->|Renderers| H[Tabular / JSON / MD Reports]
end
A --> C
| Group | Command | Use Case |
|---|---|---|
| Core | pip install imgshape |
Lightweight CI/CD verification & basic profiling (~10MB) |
| Torch | pip install "imgshape[torch]" |
Adds PyTorch-based GPU acceleration & semantic feature extraction |
| Full | pip install "imgshape[full]" |
Standard installation containing PDF reports, Plotly viz, and Torch extras |
- Issues: Encountered a bug? Open an issue.
- Discussions: Share ideas and workflows? Join the discussion.
Built by Stifler for the ML and AI Engineering community.
Star on GitHub β