Umang Umang-projects

Hi there, I'm Umang 👋

Systems Engineer | HPC & AI Infrastructure | NVIDIA RAPIDS Contributor

I bridge the gap between Python's flexibility and Hardware's raw power. While others fine-tune models, I optimize the infrastructure they run on. My focus is squeezing every last FLOP out of GPUs by writing custom kernels in CUDA C++ and OpenAI Triton.

🔥 Recent Impact: NVIDIA RAPIDS (cuDF)

I recently identified a critical memory bottleneck in NVIDIA's cuDF library (Pandas on GPU). Standard string splitting operations were causing massive memory pressure.

Problem: split().get(n) was materializing millions of unnecessary tokens.
Solution: Engineered a custom Fused "Lazy Extraction" CUDA Kernel using CuPy and Cython bindings.
Result: ~160x Speedup (137ms → 0.85ms) on Tesla T4 GPUs with Zero Memory Overhead.
Status: Proposed and contributed the fix to libcudf.

🛠️ The Arsenal (Tech Stack)

Languages & Compute	AI & Deep Learning	Systems & Tools

🧠 Core Competencies (Under the Hood)

I don't just use libraries; I understand how they work on the metal.

GPU Architecture: Memory Coalescing, Shared Memory Banking (avoiding conflicts), Warp Divergence, Tensor Cores, Occupancy tuning.
Operating Systems: Virtual Memory & Paging, Concurrency (Mutex/Semaphores/Deadlocks), Process vs Thread memory models.
Model Optimization: Kernel Fusion, KV-Cache Management (PagedAttention), Quantization (INT4/INT8), FlashAttention logic.

🚀 Engineering Highlights

Project	Description	Impact
FastInfer	Custom CUDA inference backend for Llama-3.2	30.3 tokens/sec on T4 GPU (Broken the 30TPS barrier)
Triton-Kernels	Re-implementation of RMSNorm & MatMul	Benchmarking Python-based GPU programming vs Raw CUDA
Hy-LoRA	Hybrid SVD-LoRA fine-tuning strategy	51% reduction in trainable params with minimal accuracy loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly