Skip to content

vinamra57/allmos_v2

Repository files navigation

allmos_v2

High-Performance LLM Inference Engine with Modular Architecture

allmos_v2 is a production-grade LLM inference engine implementing all major optimizations from state-of-the-art systems like nano-vLLM, while maintaining a clean modular design.

Performance Target

Goal: Match nano-vLLM throughput (~1434 tokens/sec on RTX 4070 / L4 GPU) Baseline: 62.8x faster than original Allmos (22.81 tokens/sec)

Key Optimizations

Continuous Batching - Dynamic batching with prefill/decode separation (10-50x speedup) ✅ KV Cache Reuse - Efficient memory management with block-based allocation (20-30x speedup) ✅ CUDA Graphs - Pre-captured execution graphs for decode phase (2-3x speedup) ✅ Flash Attention - Memory-efficient attention with O(N) memory vs O(N²) (1.5-2x speedup) ✅ Prefix Caching - Hash-based deduplication for shared prompt prefixes ✅ Kernel Fusion - torch.compile for fused operations (1.3x speedup) ✅ Tensor Parallelism - Multi-GPU support via shared memory IPC

Architecture

allmos_v2/
├── config.py               # Centralized configuration
├── sampling_params.py      # Generation parameters
├── llm.py                  # User-facing API
│
├── engine/                 # Core inference components
│   ├── types.py            # Abstract base classes
│   ├── sequence.py         # Sequence state management
│   ├── scheduler.py        # Continuous batching scheduler
│   ├── model_runner.py     # CUDA graph + model execution
│   └── llm_engine.py       # High-level orchestrator
│
├── memory/                 # Memory management
│   ├── types.py            # BlockManager ABC
│   └── block_manager.py    # Prefix caching implementation
│
├── layers/                 # Optimized neural network layers
│   ├── attention.py        # Flash Attention with KV cache
│   ├── sampler.py          # GPU-based token sampling
│   ├── layernorm.py        # Fused RMSNorm
│   ├── activation.py       # Fused SiLU
│   ├── rotary_embedding.py # Rotary position embeddings
│   ├── linear.py           # Tensor parallel linear layers
│   └── embed_head.py       # Vocab parallel embedding/LM head
│
├── models/                 # Model implementations
│   └── qwen3.py            # Qwen3 architecture
│
└── utils/                  # Utilities
    ├── context.py          # Attention context management
    └── loader.py           # Weight loading from HuggingFace

Installation

Prerequisites

  • Python 3.10+
  • CUDA 12.1+
  • GPU with compute capability 8.0+ (Ampere or newer)

Install Dependencies

pip install -r requirements.txt

Note: flash-attn requires GLIBC 2.32+. If you have GLIBC 2.31 (Debian 11), you can:

  • Use Docker with Ubuntu 22.04+
  • Compile from source (30+ minutes)
  • Or set enforce_eager=True to disable CUDA graphs and use standard attention

Quick Start

Basic Usage

from llm import LLM
from sampling_params import SamplingParams

# Initialize engine
llm = LLM("~/huggingface/Qwen3-0.6B/")

# Generate
outputs = llm.generate(
    prompts=["Introduce yourself", "What is 2+2?"],
    sampling_params=SamplingParams(temperature=0.8, max_tokens=100)
)

# Print results
for output in outputs:
    print(output['text'])

Run Example

python example.py

Run Benchmark

python bench.py

Expected output:

Total: 133966tok, Time: 77.04s, Throughput: 1738.82tok/s

Configuration

Config Parameters

from llm import LLM

llm = LLM(
    model="~/huggingface/Qwen3-0.6B/",

    # Memory management
    max_model_len=4096,                  # Maximum sequence length
    max_num_seqs=512,                    # Max concurrent sequences
    max_num_batched_tokens=16384,        # Max tokens per batch
    gpu_memory_utilization=0.9,          # GPU memory fraction
    kvcache_block_size=256,              # KV cache block size

    # Optimizations
    enable_cuda_graphs=True,             # Use CUDA graphs (2-3x speedup)
    enable_prefix_caching=True,          # Hash-based prefix caching
    enforce_eager=False,                 # Disable to use CUDA graphs

    # Parallelism
    tensor_parallel_size=1,              # Number of GPUs
)

Sampling Parameters

from sampling_params import SamplingParams

sp = SamplingParams(
    temperature=1.0,    # Temperature for sampling (> 0)
    max_tokens=64,      # Maximum tokens to generate
    ignore_eos=False,   # Ignore EOS token
)

Design Philosophy

Modularity

All components implement abstract base classes from engine/types.py:

  • Scheduler: Manages sequence scheduling and batching
  • ModelRunner: Executes model forward passes
  • BlockManager: Manages KV cache memory
  • LLMEngine: High-level orchestration

This enables:

  • Easy testing of individual components
  • Swapping implementations (e.g., different schedulers)
  • Clear separation of concerns

Optimization-First

Every optimization from the benchmark report is implemented:

  1. KV Cache Reuse (model_runner.py:prepare_prefill/decode)

    • Eliminates 32.5x redundant computation from original Allmos
  2. Continuous Batching (scheduler.py)

    • Separate prefill (variable length) and decode (fixed length) phases
    • Dynamic batching with preemption
  3. CUDA Graphs (model_runner.py:capture_cudagraph)

    • Pre-captured for batch sizes [1, 2, 4, 8, 16, ..., 512]
    • Eliminates kernel launch overhead
  4. Flash Attention (layers/attention.py)

    • Memory-efficient O(N) vs O(N²)
    • Custom Triton kernel for KV cache storage
  5. Prefix Caching (memory/block_manager.py)

    • xxhash-based deduplication
    • Reference counting for safe sharing
  6. Kernel Fusion (throughout layers/)

    • @torch.compile on hot paths
    • Fused residual + normalization

Production-Ready

  • Error handling: Assertions and validation throughout
  • Memory management: Automatic KV cache sizing
  • Distributed: Tensor parallelism via multiprocessing
  • Progress tracking: tqdm integration
  • Documentation: Extensive inline comments explaining design decisions

Testing

Validation Script

python test_allmos.py

This will:

  1. Test basic generation (single sequence)
  2. Test batched generation (multiple sequences)
  3. Test prefix caching (shared prefixes)
  4. Validate CUDA graph capture
  5. Measure throughput

Expected Results

  • ✅ Single sequence: Generates coherent text
  • ✅ Batched generation: Handles multiple sequences
  • ✅ Prefix caching: Detects cache hits
  • ✅ CUDA graphs: Captured for common batch sizes
  • ✅ Throughput: 1400+ tokens/sec

Troubleshooting

flash-attn import error

Error: ImportError: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.32' not found

Solution: Set enforce_eager=True to use standard PyTorch attention:

llm = LLM(model_path, enforce_eager=True)

CUDA out of memory

Solution: Reduce gpu_memory_utilization or max_num_seqs:

llm = LLM(model_path, gpu_memory_utilization=0.8, max_num_seqs=256)

Slow first generation

This is expected - first run includes:

  • Model loading
  • CUDA graph capture
  • torch.compile compilation

Subsequent runs will be much faster.

Benchmarks

Test Configuration

  • Hardware: GCP L4 GPU (23GB VRAM, Ada Lovelace)
  • Model: Qwen3-0.6B (600M parameters)
  • Workload: 256 sequences, 100-1024 tokens each
  • Settings: CUDA graphs enabled, prefix caching enabled

Results

System Throughput (tokens/sec) Speedup vs Allmos
Original Allmos 22.81 1.0x (baseline)
allmos_v2 1,739 76.2x
nano-vLLM 1,760 77.1x

Research Context

This codebase is part of a research project studying the effectiveness of AI coding assistants in developing and optimizing systems software. Key research questions:

  1. Can coding assistants implement complex optimizations (CUDA graphs, prefix caching)?
  2. Does modular architecture help or hinder optimization?
  3. How does code quality compare to human-engineered systems (nano-vLLM)?

See BENCHMARK_REPORT.md in the parent directory for detailed analysis.

Acknowledgments

Architecture and optimizations inspired by:

  • nano-vLLM - Efficient implementation by DeepSeek engineers
  • vLLM - Original continuous batching and PagedAttention
  • Flash Attention - Memory-efficient attention by Tri Dao et al.

License

MIT License - See LICENSE file for details

Citation

If you use this code in your research, please cite:

@software{allmos_v2,
  author = {Vinamra Agarwal},
  title = {allmos_v2: Modular High-Performance LLM Inference},
  year = {2025},
  institution = {University of Washington, Systems Lab}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages