allmos_v2

High-Performance LLM Inference Engine with Modular Architecture

allmos_v2 is a production-grade LLM inference engine implementing all major optimizations from state-of-the-art systems like nano-vLLM, while maintaining a clean modular design.

Performance Target

Goal: Match nano-vLLM throughput (~1434 tokens/sec on RTX 4070 / L4 GPU) Baseline: 62.8x faster than original Allmos (22.81 tokens/sec)

Key Optimizations

✅ Continuous Batching - Dynamic batching with prefill/decode separation (10-50x speedup) ✅ KV Cache Reuse - Efficient memory management with block-based allocation (20-30x speedup) ✅ CUDA Graphs - Pre-captured execution graphs for decode phase (2-3x speedup) ✅ Flash Attention - Memory-efficient attention with O(N) memory vs O(N²) (1.5-2x speedup) ✅ Prefix Caching - Hash-based deduplication for shared prompt prefixes ✅ Kernel Fusion - torch.compile for fused operations (1.3x speedup) ✅ Tensor Parallelism - Multi-GPU support via shared memory IPC

Architecture

allmos_v2/
├── config.py               # Centralized configuration
├── sampling_params.py      # Generation parameters
├── llm.py                  # User-facing API
│
├── engine/                 # Core inference components
│   ├── types.py            # Abstract base classes
│   ├── sequence.py         # Sequence state management
│   ├── scheduler.py        # Continuous batching scheduler
│   ├── model_runner.py     # CUDA graph + model execution
│   └── llm_engine.py       # High-level orchestrator
│
├── memory/                 # Memory management
│   ├── types.py            # BlockManager ABC
│   └── block_manager.py    # Prefix caching implementation
│
├── layers/                 # Optimized neural network layers
│   ├── attention.py        # Flash Attention with KV cache
│   ├── sampler.py          # GPU-based token sampling
│   ├── layernorm.py        # Fused RMSNorm
│   ├── activation.py       # Fused SiLU
│   ├── rotary_embedding.py # Rotary position embeddings
│   ├── linear.py           # Tensor parallel linear layers
│   └── embed_head.py       # Vocab parallel embedding/LM head
│
├── models/                 # Model implementations
│   └── qwen3.py            # Qwen3 architecture
│
└── utils/                  # Utilities
    ├── context.py          # Attention context management
    └── loader.py           # Weight loading from HuggingFace

Installation

Prerequisites

Python 3.10+
CUDA 12.1+
GPU with compute capability 8.0+ (Ampere or newer)

Install Dependencies

pip install -r requirements.txt

Note: flash-attn requires GLIBC 2.32+. If you have GLIBC 2.31 (Debian 11), you can:

Use Docker with Ubuntu 22.04+
Compile from source (30+ minutes)
Or set enforce_eager=True to disable CUDA graphs and use standard attention

Quick Start

Basic Usage

from llm import LLM
from sampling_params import SamplingParams

# Initialize engine
llm = LLM("~/huggingface/Qwen3-0.6B/")

# Generate
outputs = llm.generate(
    prompts=["Introduce yourself", "What is 2+2?"],
    sampling_params=SamplingParams(temperature=0.8, max_tokens=100)
)

# Print results
for output in outputs:
    print(output['text'])

Run Example

python example.py

Run Benchmark

python bench.py

Expected output:

Total: 133966tok, Time: 77.04s, Throughput: 1738.82tok/s

Configuration

Config Parameters

from llm import LLM

llm = LLM(
    model="~/huggingface/Qwen3-0.6B/",

    # Memory management
    max_model_len=4096,                  # Maximum sequence length
    max_num_seqs=512,                    # Max concurrent sequences
    max_num_batched_tokens=16384,        # Max tokens per batch
    gpu_memory_utilization=0.9,          # GPU memory fraction
    kvcache_block_size=256,              # KV cache block size

    # Optimizations
    enable_cuda_graphs=True,             # Use CUDA graphs (2-3x speedup)
    enable_prefix_caching=True,          # Hash-based prefix caching
    enforce_eager=False,                 # Disable to use CUDA graphs

    # Parallelism
    tensor_parallel_size=1,              # Number of GPUs
)

Sampling Parameters

from sampling_params import SamplingParams

sp = SamplingParams(
    temperature=1.0,    # Temperature for sampling (> 0)
    max_tokens=64,      # Maximum tokens to generate
    ignore_eos=False,   # Ignore EOS token
)

Design Philosophy

Modularity

All components implement abstract base classes from engine/types.py:

Scheduler: Manages sequence scheduling and batching
ModelRunner: Executes model forward passes
BlockManager: Manages KV cache memory
LLMEngine: High-level orchestration

This enables:

Easy testing of individual components
Swapping implementations (e.g., different schedulers)
Clear separation of concerns

Optimization-First

Every optimization from the benchmark report is implemented:

KV Cache Reuse (model_runner.py:prepare_prefill/decode)
- Eliminates 32.5x redundant computation from original Allmos
Continuous Batching (scheduler.py)
- Separate prefill (variable length) and decode (fixed length) phases
- Dynamic batching with preemption
CUDA Graphs (model_runner.py:capture_cudagraph)
- Pre-captured for batch sizes [1, 2, 4, 8, 16, ..., 512]
- Eliminates kernel launch overhead
Flash Attention (layers/attention.py)
- Memory-efficient O(N) vs O(N²)
- Custom Triton kernel for KV cache storage
Prefix Caching (memory/block_manager.py)
- xxhash-based deduplication
- Reference counting for safe sharing
Kernel Fusion (throughout layers/)
- @torch.compile on hot paths
- Fused residual + normalization

Production-Ready

Error handling: Assertions and validation throughout
Memory management: Automatic KV cache sizing
Distributed: Tensor parallelism via multiprocessing
Progress tracking: tqdm integration
Documentation: Extensive inline comments explaining design decisions

Testing

Validation Script

python test_allmos.py

This will:

Test basic generation (single sequence)
Test batched generation (multiple sequences)
Test prefix caching (shared prefixes)
Validate CUDA graph capture
Measure throughput

Expected Results

✅ Single sequence: Generates coherent text
✅ Batched generation: Handles multiple sequences
✅ Prefix caching: Detects cache hits
✅ CUDA graphs: Captured for common batch sizes
✅ Throughput: 1400+ tokens/sec

Troubleshooting

flash-attn import error

Error: ImportError: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.32' not found

Solution: Set enforce_eager=True to use standard PyTorch attention:

llm = LLM(model_path, enforce_eager=True)

CUDA out of memory

Solution: Reduce gpu_memory_utilization or max_num_seqs:

llm = LLM(model_path, gpu_memory_utilization=0.8, max_num_seqs=256)

Slow first generation

This is expected - first run includes:

Model loading
CUDA graph capture
torch.compile compilation

Subsequent runs will be much faster.

Benchmarks

Test Configuration

Hardware: GCP L4 GPU (23GB VRAM, Ada Lovelace)
Model: Qwen3-0.6B (600M parameters)
Workload: 256 sequences, 100-1024 tokens each
Settings: CUDA graphs enabled, prefix caching enabled

Results

System	Throughput (tokens/sec)	Speedup vs Allmos
Original Allmos	22.81	1.0x (baseline)
allmos_v2	1,739	76.2x
nano-vLLM	1,760	77.1x

Research Context

This codebase is part of a research project studying the effectiveness of AI coding assistants in developing and optimizing systems software. Key research questions:

Can coding assistants implement complex optimizations (CUDA graphs, prefix caching)?
Does modular architecture help or hinder optimization?
How does code quality compare to human-engineered systems (nano-vLLM)?

See BENCHMARK_REPORT.md in the parent directory for detailed analysis.

Acknowledgments

Architecture and optimizations inspired by:

nano-vLLM - Efficient implementation by DeepSeek engineers
vLLM - Original continuous batching and PagedAttention
Flash Attention - Memory-efficient attention by Tri Dao et al.

License

MIT License - See LICENSE file for details

Citation

If you use this code in your research, please cite:

@software{allmos_v2,
  author = {Vinamra Agarwal},
  title = {allmos_v2: Modular High-Performance LLM Inference},
  year = {2025},
  institution = {University of Washington, Systems Lab}
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
__pycache__		__pycache__
engine		engine
layers		layers
memory		memory
models		models
utils		utils
BENCHMARK_COMPARISON.md		BENCHMARK_COMPARISON.md
PERFORMANCE_ANALYSIS.md		PERFORMANCE_ANALYSIS.md
README.md		README.md
bench.py		bench.py
bench_fp8.py		bench_fp8.py
check_system.py		check_system.py
config.py		config.py
example.py		example.py
llm.py		llm.py
requirements.txt		requirements.txt
sampling_params.py		sampling_params.py
test_allmos.py		test_allmos.py

vinamra57/allmos_v2

Folders and files

Latest commit

History

Repository files navigation