I bridge the gap between Python's flexibility and Hardware's raw power. While others fine-tune models, I optimize the infrastructure they run on. My focus is squeezing every last FLOP out of GPUs by writing custom kernels in CUDA C++ and OpenAI Triton.
I recently identified a critical memory bottleneck in NVIDIA's cuDF library (Pandas on GPU). Standard string splitting operations were causing massive memory pressure.
- Problem:
split().get(n)was materializing millions of unnecessary tokens. - Solution: Engineered a custom Fused "Lazy Extraction" CUDA Kernel using
CuPyandCythonbindings. - Result: ~160x Speedup (137ms β 0.85ms) on Tesla T4 GPUs with Zero Memory Overhead.
- Status: Proposed and contributed the fix to
libcudf.
| Languages & Compute | AI & Deep Learning | Systems & Tools |
|---|---|---|
I don't just use libraries; I understand how they work on the metal.
- GPU Architecture: Memory Coalescing, Shared Memory Banking (avoiding conflicts), Warp Divergence, Tensor Cores, Occupancy tuning.
- Operating Systems: Virtual Memory & Paging, Concurrency (Mutex/Semaphores/Deadlocks), Process vs Thread memory models.
- Model Optimization: Kernel Fusion, KV-Cache Management (PagedAttention), Quantization (INT4/INT8), FlashAttention logic.
| Project | Description | Impact |
|---|---|---|
| FastInfer | Custom CUDA inference backend for Llama-3.2 | 30.3 tokens/sec on T4 GPU (Broken the 30TPS barrier) |
| Triton-Kernels | Re-implementation of RMSNorm & MatMul | Benchmarking Python-based GPU programming vs Raw CUDA |
| Hy-LoRA | Hybrid SVD-LoRA fine-tuning strategy | 51% reduction in trainable params with minimal accuracy loss |