Skip to content
View Umang-projects's full-sized avatar
πŸ’­
πŸ“™ Aspiring ML Systems & Efficient AI.
πŸ’­
πŸ“™ Aspiring ML Systems & Efficient AI.

Block or report Umang-projects

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Umang-projects/README.md

Hi there, I'm Umang πŸ‘‹

Systems Engineer | HPC & AI Infrastructure | NVIDIA RAPIDS Contributor

I bridge the gap between Python's flexibility and Hardware's raw power. While others fine-tune models, I optimize the infrastructure they run on. My focus is squeezing every last FLOP out of GPUs by writing custom kernels in CUDA C++ and OpenAI Triton.


πŸ”₯ Recent Impact: NVIDIA RAPIDS (cuDF)

I recently identified a critical memory bottleneck in NVIDIA's cuDF library (Pandas on GPU). Standard string splitting operations were causing massive memory pressure.

  • Problem: split().get(n) was materializing millions of unnecessary tokens.
  • Solution: Engineered a custom Fused "Lazy Extraction" CUDA Kernel using CuPy and Cython bindings.
  • Result: ~160x Speedup (137ms β†’ 0.85ms) on Tesla T4 GPUs with Zero Memory Overhead.
  • Status: Proposed and contributed the fix to libcudf.

πŸ› οΈ The Arsenal (Tech Stack)

Languages & Compute AI & Deep Learning Systems & Tools
C++ PyTorch Linux
CUDA Triton Docker
Python HuggingFace Nsight
Shell LLMs Git

🧠 Core Competencies (Under the Hood)

I don't just use libraries; I understand how they work on the metal.

  • GPU Architecture: Memory Coalescing, Shared Memory Banking (avoiding conflicts), Warp Divergence, Tensor Cores, Occupancy tuning.
  • Operating Systems: Virtual Memory & Paging, Concurrency (Mutex/Semaphores/Deadlocks), Process vs Thread memory models.
  • Model Optimization: Kernel Fusion, KV-Cache Management (PagedAttention), Quantization (INT4/INT8), FlashAttention logic.

πŸš€ Engineering Highlights

Project Description Impact
FastInfer Custom CUDA inference backend for Llama-3.2 30.3 tokens/sec on T4 GPU (Broken the 30TPS barrier)
Triton-Kernels Re-implementation of RMSNorm & MatMul Benchmarking Python-based GPU programming vs Raw CUDA
Hy-LoRA Hybrid SVD-LoRA fine-tuning strategy 51% reduction in trainable params with minimal accuracy loss

Umang's Stats

Pinned Loading

  1. Triton-Inference-Kernels Triton-Inference-Kernels Public

    Custom OpenAI Triton kernels for high-performance models inference. Accelerates models on NVIDIA GPUs by leveraging Triton's productivity and CUDA-level performance.

    Jupyter Notebook 1

  2. gpu-systems-playgrund gpu-systems-playgrund Public

    GPU Systems playground with cuda kernel expriments and performance profilling.

    Cuda 2

  3. Veritas-AI-Tracking-Misinformation-with-Autonomous-Agents Veritas-AI-Tracking-Misinformation-with-Autonomous-Agents Public

    Veritas AI: An autonomous agent crew that scrapes prediction markets to create a RAG-powered chatbot for tracking misinformation and public belief in real-time.

    Python 1

  4. Hy-LoRA-A-Hybrid-SVD-LoRA-Strategy-for-Efficient-LLM-Adaptation Hy-LoRA-A-Hybrid-SVD-LoRA-Strategy-for-Efficient-LLM-Adaptation Public

    Achieve >60% LLM compression with near-baseline perplexity using a novel "Compress-then-Adapt" strategy.

    Python 1

  5. cudf cudf Public

    Forked from rapidsai/cudf

    cuDF - GPU DataFrame Library

    C++ 1

  6. cudf-lazy-string-poc cudf-lazy-string-poc Public

    Proof of Concept: Achieving ~160x speedup on cuDF string extraction (split().get()) using fused lazy CUDA kernels on Tesla T4.

    Python 1