Skip to content

Commit a78bf5c

Browse files
committed
Revise Colab setup instructions for TPU: streamline steps for importing notebooks and connecting to local Jupyter Lab.
1 parent 6f47311 commit a78bf5c

File tree

3 files changed

+57
-72
lines changed

3 files changed

+57
-72
lines changed

docs/guides.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ guides/optimization.md
2323
guides/data_input_pipeline.md
2424
guides/checkpointing_solutions.md
2525
guides/monitoring_and_debugging.md
26+
guides/run_python_notebook.md
2627
```

docs/tutorials/posttraining/how_to_run_colabs.md renamed to docs/guides/run_python_notebook.md

Lines changed: 55 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
1-
# Connect notebooks to TPUs
1+
# Run MaxText Python Notebooks on TPUs
22

3-
This guide provides comprehensive instructions for setting up Jupyter Lab on TPU and connecting it to Google Colab for running MaxText examples.
3+
This guide provides clear, step-by-step instructions for getting started with python notebooks on the two most popular platforms: Google Colab and a local JupyterLab environment.
44

55
## 📑 Table of Contents
66

77
- [Prerequisites](#prerequisites)
8-
- [Method 1: Google Colab with TPU (Recommended)](#method-1-google-colab-with-tpu-recommended)
8+
- [Method 1: Google Colab with TPU](#method-1-google-colab-with-tpu)
99
- [Method 2: Local Jupyter Lab with TPU](#method-2-local-jupyter-lab-with-tpu)
10-
- [Method 3: Colab + Local Jupyter Lab Hybrid](#method-3-colab--local-jupyter-lab-hybrid)
1110
- [Available Examples](#available-examples)
1211
- [Common Pitfalls & Debugging](#common-pitfalls--debugging)
1312
- [Support & Resources](#support-and-resources)
@@ -17,77 +16,82 @@ This guide provides comprehensive instructions for setting up Jupyter Lab on TPU
1716

1817
Before starting, make sure you have:
1918

19+
- ✅ Basic familiarity with Jupyter, Python, and Git
20+
21+
**For Method 2 (Local Jupyter Lab) only:**
2022
- ✅ A Google Cloud Platform (GCP) account with billing enabled
2123
- ✅ TPU quota available in your region (check under IAM & Admin → Quotas)
22-
-Basic familiarity with Jupyter, Python, and Git
23-
- ✅ gcloud CLI installed locally if you plan to use Method 2 or 3
24+
-`tpu.nodes.create` permission to create a TPU VM
25+
- ✅ gcloud CLI installed locally
2426
- ✅ Firewall rules open for port 8888 (Jupyter) if accessing directly
2527

26-
## Method 1: Google Colab with TPU (Recommended)
28+
## Method 1: Google Colab with TPU
2729

28-
This is the fastest way to run MaxText without managing infrastructure.
30+
This is the fastest way to run MaxText python notebooks without managing infrastructure.
2931

30-
### Step 1: Open Google Colab
32+
**⚠️ IMPORTANT NOTE ⚠️**
33+
The free tier of Google Colab provides access to `v5e-1 TPU`, but this access is not guaranteed and is subject to availability and usage limits.
3134

32-
1. Go to [Google Colab](https://colab.research.google.com/)
33-
2. Sign in → New Notebook
35+
Before proceeding, please verify that the specific notebook you are running works reliably on the free-tier TPU resources. If you encounter frequent disconnections or resource limitations, you may need to:
3436

35-
### Step 2: Enable TPU Runtime
37+
* Upgrade to a Colab Pro or Pro+ subscription for more stable and powerful TPU access.
3638

37-
1. **Runtime****Change runtime type**
38-
2. Set **Hardware accelerator****TPU**
39-
3. Select TPU version:
40-
- **v5e-8** → recommended for most MaxText examples, but it's a paid option
41-
- **v5e-1** → free tier option (slower, but works for Qwen-0.6B demos)
42-
4. Click **Save**
39+
* Move to local Jupyter Lab setup method with access to a powerful TPU machine.
4340

44-
### Step 3: Upload & Prepare MaxText
41+
### Step 1: Choose an Example
42+
1.a. Visit the [MaxText examples directory](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/MaxText/examples) on Github.
4543

46-
Upload notebooks or mount your GitHub repo
44+
1.b. Find the notebook you want to run (e.g., `sft_qwen3_demo.ipynb`) and copy its URL.
4745

48-
> **Note:** In Colab, the repo root will usually be `/content/maxtext`
46+
### Step 2: Import into Colab
47+
2.a. Go to [Google Colab](https://colab.research.google.com/) and sign in.
4948

50-
**Example:**
51-
```bash
52-
!git clone https://github.com/AI-Hypercomputer/maxtext.git
53-
%cd maxtext
54-
```
49+
2.b. Select **File** -> **Open Notebook**.
50+
51+
2.c. Select the **GitHub** tab.
5552

56-
### Step 4: Run Examples
53+
2.d. Paste the target `.ipynb` link you copied in step 1.b and press Enter.
5754

58-
1. Open `src/MaxText/examples/`
59-
2. Try:
60-
- `sft_qwen3_demo.ipynb`
61-
- `sft_llama3_demo.ipynb`
62-
- `rl_llama3_demo.ipynb` (GRPO/GSPO training)
55+
### Step 3: Enable TPU Runtime
6356

57+
3.a. **Runtime****Change runtime type**
6458

65-
> **Tip:** If Colab disconnects, re-enable TPU and re-run setup cells. Save checkpoints to GCS or Drive.
59+
3.b. Select your desired **TPU** under **Hardware accelerator**
6660

67-
> **Tip:** If Colab asks to restart session - do it and continue to run cells
61+
3.c. Click **Save**
62+
63+
### Step 4: Run the Notebook
64+
Follow the instructions within the notebook cells to install dependencies and run the training/inference.
6865

6966
## Method 2: Local Jupyter Lab with TPU
7067

71-
This method gives you more control and is better for long training runs.
68+
You can run Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
7269

7370
### Step 1: Set Up TPU VM
7471

7572
In Google Cloud Console:
7673

77-
1. **Compute Engine****TPU****Create TPU Node**
78-
2. Example config:
74+
1.a. **Compute Engine****TPU****Create TPU**
75+
76+
1.b. Example config:
7977
- **Name:** `maxtext-tpu-node`
80-
- **TPU type:** `v5e-8` (or `v6p-8` for newer hardware)
81-
- **Runtime Version:** `tpu-ubuntu-alpha-*` (matches your VM image)
78+
- **TPU type:** Choose your desired TPU type
79+
- **Runtime Version:** `tpu-ubuntu2204-base` (or other compatible runtime)
8280

83-
### Step 2: Connect to TPU VM
81+
### Step 2: Connect with Port Forwarding
82+
Run the following command on your local machine:
83+
> **Note**: The `--` separator before the `-L` flag is required. This tunnels the remote port 8888 to your local machine securely.
8484
8585
```bash
86-
gcloud compute tpus tpu-vm ssh maxtext-tpu-node --zone=YOUR_ZONE
86+
gcloud compute tpus tpu-vm ssh maxtext-tpu-node --zone=YOUR_ZONE -- -L 8888:localhost:8888
8787
```
8888

89+
> **Note**: If you get a "bind: Address already in use" error, it means port 8888 is busy on your local computer. Change the first number to a different port, e.g., -L 9999:localhost:8888. You will then access Jupyter at localhost:9999.
90+
8991
### Step 3: Install Dependencies
9092

93+
Run the following commands on your TPU-VM:
94+
9195
```bash
9296
sudo apt update && sudo apt upgrade -y
9397
sudo apt install python3-pip python3-dev git -y
@@ -100,23 +104,15 @@ pip3 install jupyterlab
100104
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
101105
```
102106

103-
Copy the URL with token from terminal
107+
### Step 5: Access the Notebook
108+
5.a. Look at the terminal output for a URL that looks like: `http://127.0.0.1:8888/lab?token=...`
104109

105-
### Step 5: Secure Access
110+
5.b. Copy that URL.
106111

107-
#### Option A: SSH Tunnel (Recommended)
112+
5.c. Paste it into your **local computer's browser**.
113+
* **Important:** If you changed the port in Step 2 (e.g., to `9999`), you must manually replace `8888` in the URL with `9999`.
114+
* *Example:* `http://127.0.0.1:9999/lab?token=...`
108115

109-
```bash
110-
gcloud compute tpus tpu-vm ssh maxtext-tpu-node --zone=YOUR_ZONE -- -L 8888:localhost:8888
111-
```
112-
113-
Then open → `http://localhost:8888`
114-
115-
116-
## Method 3: Colab + Local Jupyter Lab Hybrid
117-
118-
Set up Jupyter Lab as in step 2.
119-
Use the link for Jupyter Lab as a link for "Connect to a local runtime" in Collab - at the dropdown where you select the runtime.
120116

121117
## Available Examples
122118

@@ -127,18 +123,7 @@ Use the link for Jupyter Lab as a link for "Connect to a local runtime" in Colla
127123

128124
### Reinforcement Learning (GRPO/GSPO) Training
129125

130-
- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on math dataset (Colab/notebook)
131-
132-
#### GRPO/GSPO Colab Usage
133-
134-
For interactive GRPO or GSPO training in Google Colab or Jupyter:
135-
136-
1. **Open** `src/MaxText/examples/rl_llama3_demo.ipynb`
137-
2. **Enable TPU runtime** (Runtime → Change runtime type → TPU)
138-
3. **Set `LOSS_ALGO`** to `"grpo"` for GRPO or `"gspo-token"` for GSPO
139-
4. **Run cells** to train Llama3.1-8B with GRPO or GSPO on GSM8K dataset
140-
141-
> **Note:** GRPO (Group Relative Policy Optimization) optimizes each token, while GSPO (Group Sequence Policy Optimization) optimizes the whole sequence. The difference is controlled by the `loss_algo` parameter.
126+
- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
142127

143128
#### GRPO/GSPO Python Script Usage - local runs
144129

@@ -176,22 +161,22 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
176161
177162
#### GRPO/GSPO Python Script Usage - cluster runs
178163

179-
For running on clusters, please refer to `maxtext/docs/tutorials/grpo_with_pathways.md`
164+
For running on clusters, please refer to [Reinforcement Learning on multi-most TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html)
180165

181166

182167
## Common Pitfalls & Debugging
183168

184169
| Issue | Solution |
185170
|-------|----------|
186-
| ❌ TPU runtime mismatch | Check TPU runtime version matches VM image (`tpu-ubuntu-alpha-*`) |
171+
| ❌ TPU runtime mismatch | Check TPU runtime version matches VM image |
187172
| ❌ Colab disconnects | Save checkpoints to GCS or Drive regularly |
188173
| ❌ "RESOURCE_EXHAUSTED" errors | Use smaller batch size or v5e-8 instead of v5e-1 |
189174
| ❌ Firewall blocked | Ensure port 8888 open, or always use SSH tunneling |
190175
| ❌ Path confusion | In Colab use `/content/maxtext`; in TPU VM use `~/maxtext` |
191176

192177
## Support and Resources
193178

194-
- 📘 [MaxText Documentation](https://github.com/AI-Hypercomputer/maxtext)
179+
- 📘 [MaxText Documentation](https://maxtext.readthedocs.io/)
195180
- 💻 [Google Colab](https://colab.research.google.com)
196181
-[Cloud TPU Docs](https://cloud.google.com/tpu/docs)
197182
- 🧩 [Jupyter Lab](https://jupyterlab.readthedocs.io)

docs/tutorials/post_training_index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,13 @@ Pathways supercharges RL with:
4949

5050
## Getting started
5151

52-
Start your Post-Training journey through quick experimentation with our [Google Colabs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/how_to_run_colabs.html) or our Production level tutorials for [SFT](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html) and [RL](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
52+
Start your Post-Training journey through quick experimentation with [Python Notebooks](https://maxtext.readthedocs.io/en/latest/guides/run_python_notebook.html) or our Production level tutorials for [SFT](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html) and [RL](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
5353

5454
## More tutorials
5555

5656
```{toctree}
5757
:maxdepth: 1
5858
59-
posttraining/how_to_run_colabs.md
6059
posttraining/sft.md
6160
posttraining/sft_on_multi_host.md
6261
posttraining/rl.md

0 commit comments

Comments
 (0)