High-Performance LLM Inference Engine with Modular Architecture
allmos_v2 is a production-grade LLM inference engine implementing all major optimizations from state-of-the-art systems like nano-vLLM, while maintaining a clean modular design.
Goal: Match nano-vLLM throughput (~1434 tokens/sec on RTX 4070 / L4 GPU) Baseline: 62.8x faster than original Allmos (22.81 tokens/sec)
✅ Continuous Batching - Dynamic batching with prefill/decode separation (10-50x speedup) ✅ KV Cache Reuse - Efficient memory management with block-based allocation (20-30x speedup) ✅ CUDA Graphs - Pre-captured execution graphs for decode phase (2-3x speedup) ✅ Flash Attention - Memory-efficient attention with O(N) memory vs O(N²) (1.5-2x speedup) ✅ Prefix Caching - Hash-based deduplication for shared prompt prefixes ✅ Kernel Fusion - torch.compile for fused operations (1.3x speedup) ✅ Tensor Parallelism - Multi-GPU support via shared memory IPC
allmos_v2/
├── config.py # Centralized configuration
├── sampling_params.py # Generation parameters
├── llm.py # User-facing API
│
├── engine/ # Core inference components
│ ├── types.py # Abstract base classes
│ ├── sequence.py # Sequence state management
│ ├── scheduler.py # Continuous batching scheduler
│ ├── model_runner.py # CUDA graph + model execution
│ └── llm_engine.py # High-level orchestrator
│
├── memory/ # Memory management
│ ├── types.py # BlockManager ABC
│ └── block_manager.py # Prefix caching implementation
│
├── layers/ # Optimized neural network layers
│ ├── attention.py # Flash Attention with KV cache
│ ├── sampler.py # GPU-based token sampling
│ ├── layernorm.py # Fused RMSNorm
│ ├── activation.py # Fused SiLU
│ ├── rotary_embedding.py # Rotary position embeddings
│ ├── linear.py # Tensor parallel linear layers
│ └── embed_head.py # Vocab parallel embedding/LM head
│
├── models/ # Model implementations
│ └── qwen3.py # Qwen3 architecture
│
└── utils/ # Utilities
├── context.py # Attention context management
└── loader.py # Weight loading from HuggingFace
- Python 3.10+
- CUDA 12.1+
- GPU with compute capability 8.0+ (Ampere or newer)
pip install -r requirements.txtNote: flash-attn requires GLIBC 2.32+. If you have GLIBC 2.31 (Debian 11), you can:
- Use Docker with Ubuntu 22.04+
- Compile from source (30+ minutes)
- Or set
enforce_eager=Trueto disable CUDA graphs and use standard attention
from llm import LLM
from sampling_params import SamplingParams
# Initialize engine
llm = LLM("~/huggingface/Qwen3-0.6B/")
# Generate
outputs = llm.generate(
prompts=["Introduce yourself", "What is 2+2?"],
sampling_params=SamplingParams(temperature=0.8, max_tokens=100)
)
# Print results
for output in outputs:
print(output['text'])python example.pypython bench.pyExpected output:
Total: 133966tok, Time: 77.04s, Throughput: 1738.82tok/s
from llm import LLM
llm = LLM(
model="~/huggingface/Qwen3-0.6B/",
# Memory management
max_model_len=4096, # Maximum sequence length
max_num_seqs=512, # Max concurrent sequences
max_num_batched_tokens=16384, # Max tokens per batch
gpu_memory_utilization=0.9, # GPU memory fraction
kvcache_block_size=256, # KV cache block size
# Optimizations
enable_cuda_graphs=True, # Use CUDA graphs (2-3x speedup)
enable_prefix_caching=True, # Hash-based prefix caching
enforce_eager=False, # Disable to use CUDA graphs
# Parallelism
tensor_parallel_size=1, # Number of GPUs
)from sampling_params import SamplingParams
sp = SamplingParams(
temperature=1.0, # Temperature for sampling (> 0)
max_tokens=64, # Maximum tokens to generate
ignore_eos=False, # Ignore EOS token
)All components implement abstract base classes from engine/types.py:
- Scheduler: Manages sequence scheduling and batching
- ModelRunner: Executes model forward passes
- BlockManager: Manages KV cache memory
- LLMEngine: High-level orchestration
This enables:
- Easy testing of individual components
- Swapping implementations (e.g., different schedulers)
- Clear separation of concerns
Every optimization from the benchmark report is implemented:
-
KV Cache Reuse (
model_runner.py:prepare_prefill/decode)- Eliminates 32.5x redundant computation from original Allmos
-
Continuous Batching (
scheduler.py)- Separate prefill (variable length) and decode (fixed length) phases
- Dynamic batching with preemption
-
CUDA Graphs (
model_runner.py:capture_cudagraph)- Pre-captured for batch sizes [1, 2, 4, 8, 16, ..., 512]
- Eliminates kernel launch overhead
-
Flash Attention (
layers/attention.py)- Memory-efficient O(N) vs O(N²)
- Custom Triton kernel for KV cache storage
-
Prefix Caching (
memory/block_manager.py)- xxhash-based deduplication
- Reference counting for safe sharing
-
Kernel Fusion (throughout
layers/)@torch.compileon hot paths- Fused residual + normalization
- Error handling: Assertions and validation throughout
- Memory management: Automatic KV cache sizing
- Distributed: Tensor parallelism via multiprocessing
- Progress tracking: tqdm integration
- Documentation: Extensive inline comments explaining design decisions
python test_allmos.pyThis will:
- Test basic generation (single sequence)
- Test batched generation (multiple sequences)
- Test prefix caching (shared prefixes)
- Validate CUDA graph capture
- Measure throughput
- ✅ Single sequence: Generates coherent text
- ✅ Batched generation: Handles multiple sequences
- ✅ Prefix caching: Detects cache hits
- ✅ CUDA graphs: Captured for common batch sizes
- ✅ Throughput: 1400+ tokens/sec
Error: ImportError: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.32' not found
Solution: Set enforce_eager=True to use standard PyTorch attention:
llm = LLM(model_path, enforce_eager=True)Solution: Reduce gpu_memory_utilization or max_num_seqs:
llm = LLM(model_path, gpu_memory_utilization=0.8, max_num_seqs=256)This is expected - first run includes:
- Model loading
- CUDA graph capture
- torch.compile compilation
Subsequent runs will be much faster.
- Hardware: GCP L4 GPU (23GB VRAM, Ada Lovelace)
- Model: Qwen3-0.6B (600M parameters)
- Workload: 256 sequences, 100-1024 tokens each
- Settings: CUDA graphs enabled, prefix caching enabled
| System | Throughput (tokens/sec) | Speedup vs Allmos |
|---|---|---|
| Original Allmos | 22.81 | 1.0x (baseline) |
| allmos_v2 | 1,739 | 76.2x |
| nano-vLLM | 1,760 | 77.1x |
This codebase is part of a research project studying the effectiveness of AI coding assistants in developing and optimizing systems software. Key research questions:
- Can coding assistants implement complex optimizations (CUDA graphs, prefix caching)?
- Does modular architecture help or hinder optimization?
- How does code quality compare to human-engineered systems (nano-vLLM)?
See BENCHMARK_REPORT.md in the parent directory for detailed analysis.
Architecture and optimizations inspired by:
- nano-vLLM - Efficient implementation by DeepSeek engineers
- vLLM - Original continuous batching and PagedAttention
- Flash Attention - Memory-efficient attention by Tri Dao et al.
MIT License - See LICENSE file for details
If you use this code in your research, please cite:
@software{allmos_v2,
author = {Vinamra Agarwal},
title = {allmos_v2: Modular High-Performance LLM Inference},
year = {2025},
institution = {University of Washington, Systems Lab}
}