Skip to content

DarkStarStrix/Nexa_Compute

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nexa Compute & Nexa Forge

AI Infrastructure Platform with Managed API Service

Tests Linting Python License Version Code Style Dependencies FastAPI PyTorch

A complete AI foundry platform for orchestrating data generation, model distillation, training, and evaluation on ephemeral GPU compute.


Quick Start

1. Setup Environment

We use uv for dependency management. The repo is configured to work with the Homebrew install at /opt/homebrew/bin/uv.

# optional: create a project virtualenv
/opt/homebrew/bin/uv venv .venv
source .venv/bin/activate

# install all runtime + dev dependencies deterministically
/opt/homebrew/bin/uv pip sync requirements/requirements-dev.lock

# install git hooks
pre-commit install

2. Training Workflow

Prerequisites: You must bring your own compute from your preferred vendor (AWS, GCP, Lambda, Prime Intellect, etc.).

  1. Spin up a GPU instance (Ubuntu 22.04 recommended).
  2. Copy .env.example to .env and configure your services (WandB, HuggingFace, S3).

Deploy a Training Node: SSH into your node and run the turn-key deployment script. This will sync your code, install dependencies, and set up a persistent workspace.

./nexa_infra/scripts/provision/deploy.sh ubuntu@gpu-node-ip

Start Training (Remote or Local): Once attached to the remote session (tmux), you can start training immediately:

# Run V1 Stability Plan
python nexa_train/train.py --config-mode v1 --run-name my_stability_run

# Run V2 Performance Plan (Distributed)
torchrun --nproc_per_node=8 nexa_train/train.py --config-mode v2 --dry-run true

3. Run Infrastructure

Start Backend & Dashboard:

# Using the orchestrator script
./nexa_infra/scripts/orchestration/start_forge.sh

Project Structure

Nexa_compute/
├── nexa_data/           # Data Engineering (MS/MS, Tool Use, Distillation)
├── nexa_train/          # Training Engine (Axolotl, HF Trainer)
├── nexa_distill/        # Knowledge Distillation Pipeline
├── nexa_eval/           # Evaluation & LLM-as-a-Judge
├── nexa_inference/      # vLLM Serving & Tool Controller
├── nexa_infra/          # IaC (Terraform), Monitoring, Provisioning
├── nexa_ui/             # Dashboards (Streamlit/Next.js)
├── src/
│   └── nexa_compute/
│       ├── api/         # FastAPI backend
│       ├── cli/         # CLI Entrypoint
│       ├── core/        # Core Primitives (DAG, Registry, Artifacts)
│       ├── data/        # DataOps (Versioning, Lineage)
│       ├── models/      # ModelOps (Registry, Versioning)
│       ├── monitoring/  # Observability (Alerts, Metrics, Drift)
│       └── orchestration/ # Workflow Engine (Scheduler, Templates)
├── docs/
│   ├── compute_plans/   # Training Configuration Templates (V1/V2/V3)
│   ├── pipelines/       # Detailed Architecture Docs
│   ├── platform/        # Platform Guide & Best Practices
│   ├── api/             # API Reference
│   └── projects/        # Active Research Projects
├── sdk/                 # Python Client SDK
└── pyproject.toml       # Dependencies & Config

Core Features

Compute Engine

  • Unified Training CLI: nexa_train/train.py supports flexible overrides and configuration modes (V1 Stability, V2 Performance, V3 Full).
  • Infrastructure as Code: Terraform modules for AWS GPU clusters.
  • Observability: Distributed tracing (OpenTelemetry), Prometheus metrics, and real-time cost tracking.
  • Automated Provisioning: One-command deployment to bare metal or cloud instances with Spot instance support.

Managed API (Nexa Forge)

  • Workflows: Declarative pipeline orchestration (DAGs) with resume capability.
  • 6 Job Types: Generate, Audit, Distill, Train, Evaluate, Deploy.
  • Worker Orchestration: Pull-based job queue for ephemeral workers.
  • Security: SHA256 API keys and metered billing.

MLOps & DataOps

  • Model Registry: Full lineage tracking from dataset to deployed model.
  • Data Versioning: Content-addressable storage for datasets.
  • Monitoring: Automated drift detection and A/B testing framework.

Documentation

For detailed instructions on how the platform works and what each component does, please refer to the documentation:


Contributing

We welcome contributions! Please review our guidelines before submitting pull requests.

See docs/conventions/ for:

  • Coding Standards
  • Data Organization
  • Naming Conventions

Development Commands

  1. Linting: ruff check .
  2. Testing: pytest tests/
  3. Infrastructure: Validate Terraform with terraform validate.

Tags: machine-learning, distributed-training, infrastructure-as-code, mlops, knowledge-distillation, fastapi, pytorch, spectral-analysis

About

My training infra

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •