|
| 1 | +--- |
| 2 | +title: "Efficient and Scalable Distributed LLM Training: Hiding Communication Overhead" |
| 3 | + |
| 4 | +event: Weekly Talk |
| 5 | +event_url: |
| 6 | + |
| 7 | +location: COM3-B1-15 - Meeting Rm 92 |
| 8 | +address: |
| 9 | + street: |
| 10 | + city: |
| 11 | + region: |
| 12 | + postcode: |
| 13 | + country: Singapore |
| 14 | + |
| 15 | +summary: |
| 16 | +abstract: "Training Large Language Models (LLMs) is often inefficient due to high communication overhead, resulting in sub-50% Model FLOPS Utilization (MFU). In this talk, I will discuss how to build a cost-efficient and scalable machine learning system, using DHelix as an example. Inspired by the DNA double-helix structure, DHelix improves efficiency through Strand Interleaving (SI), which overlaps forward and backward passes to maximize computation-communication concurrency. It seamlessly integrates with all parallelism strategies, including pipeline parallelism via a model folding design. |
| 17 | +
|
| 18 | +Experiments on Llama, GPT, and Phi MoE models across A40, A800, and H100 clusters demonstrate up to 58% MFU on A40 and 71% on A800, significantly outperforming state-of-the-art methods. I will explore DHelix’s design, optimization techniques, and its broader impact on distributed LLM training." |
| 19 | + |
| 20 | +# Talk start and end times. |
| 21 | +# End time can optionally be hidden by prefixing the line with `#`. |
| 22 | +date: "2025-02-12T14:00:00Z" |
| 23 | +date_end: "2025-02-12T15:00:00Z" |
| 24 | +all_day: false |
| 25 | + |
| 26 | +# Schedule page publish date (NOT talk date). |
| 27 | +publishDate: "2017-01-01T00:00:00Z" |
| 28 | + |
| 29 | +authors: [Chaoyi Ruan] |
| 30 | +tags: [Weekly Talk] |
| 31 | + |
| 32 | +# Is this a featured talk? (true/false) |
| 33 | +featured: false |
| 34 | + |
| 35 | +image: |
| 36 | + caption: 'Image credit: [**Unsplash**](https://unsplash.com/photos/bzdhc5b3Bxs)' |
| 37 | + focal_point: Right |
| 38 | + |
| 39 | +url_code: "" |
| 40 | +url_pdf: "" |
| 41 | +url_slides: "" |
| 42 | +url_video: "" |
| 43 | + |
| 44 | +# Markdown Slides (optional). |
| 45 | +# Associate this talk with Markdown slides. |
| 46 | +# Simply enter your slide deck's filename without extension. |
| 47 | +# E.g. `slides = "example-slides"` references `content/slides/example-slides.md`. |
| 48 | +# Otherwise, set `slides = ""`. |
| 49 | +slides: |
| 50 | + |
| 51 | +# Projects (optional). |
| 52 | +# Associate this post with one or more of your projects. |
| 53 | +# Simply enter your project's folder or file name without extension. |
| 54 | +# E.g. `projects = ["internal-project"]` references `content/project/deep-learning/index.md`. |
| 55 | +# Otherwise, set `projects = []`. |
| 56 | +projects: |
| 57 | + |
| 58 | +# Slides can be added in a few ways: |
| 59 | +# |
| 60 | +# - **Create** slides using Wowchemy's [*Slides*](https://wowchemy.com/docs/managing-content/#create-slides) feature and link using `slides` parameter in the front matter of the talk file |
| 61 | +# - **Upload** an existing slide deck to `static/` and link using `url_slides` parameter in the front matter of the talk file |
| 62 | +# - **Embed** your slides (e.g. Google Slides) or presentation video on this page using [shortcodes](https://wowchemy.com/docs/writing-markdown-latex/). |
| 63 | +# |
| 64 | +# Further event details, including page elements such as image galleries, can be added to the body of this page. |
| 65 | + |
| 66 | + |
| 67 | +--- |
0 commit comments