AI Infrastructure Platform with Managed API Service
A complete AI foundry platform for orchestrating data generation, model distillation, training, and evaluation on ephemeral GPU compute.
We use uv for dependency management. The repo is configured to work with the Homebrew install at /opt/homebrew/bin/uv.
# optional: create a project virtualenv
/opt/homebrew/bin/uv venv .venv
source .venv/bin/activate
# install all runtime + dev dependencies deterministically
/opt/homebrew/bin/uv pip sync requirements/requirements-dev.lock
# install git hooks
pre-commit installPrerequisites: You must bring your own compute from your preferred vendor (AWS, GCP, Lambda, Prime Intellect, etc.).
- Spin up a GPU instance (Ubuntu 22.04 recommended).
- Copy
.env.exampleto.envand configure your services (WandB, HuggingFace, S3).
Deploy a Training Node: SSH into your node and run the turn-key deployment script. This will sync your code, install dependencies, and set up a persistent workspace.
./nexa_infra/scripts/provision/deploy.sh ubuntu@gpu-node-ipStart Training (Remote or Local): Once attached to the remote session (tmux), you can start training immediately:
# Run V1 Stability Plan
python nexa_train/train.py --config-mode v1 --run-name my_stability_run
# Run V2 Performance Plan (Distributed)
torchrun --nproc_per_node=8 nexa_train/train.py --config-mode v2 --dry-run trueStart Backend & Dashboard:
# Using the orchestrator script
./nexa_infra/scripts/orchestration/start_forge.shNexa_compute/
├── nexa_data/ # Data Engineering (MS/MS, Tool Use, Distillation)
├── nexa_train/ # Training Engine (Axolotl, HF Trainer)
├── nexa_distill/ # Knowledge Distillation Pipeline
├── nexa_eval/ # Evaluation & LLM-as-a-Judge
├── nexa_inference/ # vLLM Serving & Tool Controller
├── nexa_infra/ # IaC (Terraform), Monitoring, Provisioning
├── nexa_ui/ # Dashboards (Streamlit/Next.js)
├── src/
│ └── nexa_compute/
│ ├── api/ # FastAPI backend
│ ├── cli/ # CLI Entrypoint
│ ├── core/ # Core Primitives (DAG, Registry, Artifacts)
│ ├── data/ # DataOps (Versioning, Lineage)
│ ├── models/ # ModelOps (Registry, Versioning)
│ ├── monitoring/ # Observability (Alerts, Metrics, Drift)
│ └── orchestration/ # Workflow Engine (Scheduler, Templates)
├── docs/
│ ├── compute_plans/ # Training Configuration Templates (V1/V2/V3)
│ ├── pipelines/ # Detailed Architecture Docs
│ ├── platform/ # Platform Guide & Best Practices
│ ├── api/ # API Reference
│ └── projects/ # Active Research Projects
├── sdk/ # Python Client SDK
└── pyproject.toml # Dependencies & Config
- Unified Training CLI:
nexa_train/train.pysupports flexible overrides and configuration modes (V1 Stability, V2 Performance, V3 Full). - Infrastructure as Code: Terraform modules for AWS GPU clusters.
- Observability: Distributed tracing (OpenTelemetry), Prometheus metrics, and real-time cost tracking.
- Automated Provisioning: One-command deployment to bare metal or cloud instances with Spot instance support.
- Workflows: Declarative pipeline orchestration (DAGs) with resume capability.
- 6 Job Types: Generate, Audit, Distill, Train, Evaluate, Deploy.
- Worker Orchestration: Pull-based job queue for ephemeral workers.
- Security: SHA256 API keys and metered billing.
- Model Registry: Full lineage tracking from dataset to deployed model.
- Data Versioning: Content-addressable storage for datasets.
- Monitoring: Automated drift detection and A/B testing framework.
For detailed instructions on how the platform works and what each component does, please refer to the documentation:
- Documentation Map: Central index for all documentation.
- Platform Guide: Overview of platform capabilities.
- API Reference: API endpoints and usage.
- Infrastructure Guide: Docker, Provisioning, and Hardware.
- Training Pipeline: Configuration and Execution.
- Data Refinery: MS/MS and Synthetic Data.
- Compute Plans: Run Configurations.
We welcome contributions! Please review our guidelines before submitting pull requests.
See docs/conventions/ for:
- Coding Standards
- Data Organization
- Naming Conventions
- Linting:
ruff check . - Testing:
pytest tests/ - Infrastructure: Validate Terraform with
terraform validate.
Tags: machine-learning, distributed-training, infrastructure-as-code, mlops, knowledge-distillation, fastapi, pytorch, spectral-analysis