Skip to content

Commit b8a4cdf

Browse files
committed
Apply docstrfmt to ReST files
1 parent d8a8329 commit b8a4cdf

19 files changed

+4731
-3848
lines changed

content/0-setup.rst

+58-54
Original file line numberDiff line numberDiff line change
@@ -6,123 +6,127 @@ Setup
66
Local installation
77
------------------
88

9-
Since this lesson is taught using an HPC cluster, no software installation on your own computer is needed.
10-
9+
Since this lesson is taught using an HPC cluster, no software installation on your own
10+
computer is needed.
1111

1212
Running on LUMI
1313
---------------
1414

15-
Interactive job, 1 node, 1 GPU, 1 hour:
15+
Interactive job, 1 node, 1 GPU, 1 hour:
1616

1717
.. code-block:: console
1818
19-
$ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
20-
$ srun <some-command>
19+
$ salloc -A project_465001310 -N 1 -t 1:00:00 -p standard-g --gpus-per-node=1
20+
$ srun <some-command>
2121
2222
Exit interactive allocation with ``exit``.
2323

2424
Interacive terminal session on compute node:
2525

2626
.. code-block:: console
2727
28-
$ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
29-
$ <some-command>
28+
$ srun --account=project_465001310 --partition=standard-g --nodes=1 --cpus-per-task=1 --ntasks-per-node=1 --gpus-per-node=1 --time=1:00:00 --pty bash
29+
$ <some-command>
3030
3131
Corresponding batch script ``submit.sh``:
3232

3333
.. code-block:: bash
3434
35-
#!/bin/bash -l
36-
#SBATCH --account=project_465001310
37-
#SBATCH --job-name=example-job
38-
#SBATCH --output=examplejob.o%j
39-
#SBATCH --error=examplejob.e%j
40-
#SBATCH --partition=standard-g
41-
#SBATCH --nodes=1
42-
#SBATCH --gpus-per-node=1
43-
#SBATCH --ntasks-per-node=1
44-
#SBATCH --time=1:00:00
35+
#!/bin/bash -l
36+
#SBATCH --account=project_465001310
37+
#SBATCH --job-name=example-job
38+
#SBATCH --output=examplejob.o%j
39+
#SBATCH --error=examplejob.e%j
40+
#SBATCH --partition=standard-g
41+
#SBATCH --nodes=1
42+
#SBATCH --gpus-per-node=1
43+
#SBATCH --ntasks-per-node=1
44+
#SBATCH --time=1:00:00
4545
46-
srun <some_command>
46+
srun <some_command>
4747
4848
- Submit the job: ``sbatch submit.sh``
4949
- Monitor your job: ``squeue --me``
5050
- Kill job: ``scancel <JOB_ID>``
5151

52-
53-
5452
Running Julia on LUMI
55-
^^^^^^^^^^^^^^^^^^^^^
53+
~~~~~~~~~~~~~~~~~~~~~
5654

57-
In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory structure and assume it is our working directory.
55+
In order to run Julia with ``AMDGPU.jl`` on LUMI, we use the following directory
56+
structure and assume it is our working directory.
5857

5958
.. code-block:: console
6059
61-
.
62-
├── Project.toml # Julia environment
63-
├── script.jl # Julia script
64-
└── submit.sh # Slurm batch script
60+
.
61+
├── Project.toml # Julia environment
62+
├── script.jl # Julia script
63+
└── submit.sh # Slurm batch script
6564
6665
An example of a ``Project.toml`` project file.
6766

6867
.. code-block:: console
6968
70-
[deps]
71-
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
69+
[deps]
70+
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
7271
73-
For the ``submit.sh`` batch script, include additional content to the batch script mentioned above.
72+
For the ``submit.sh`` batch script, include additional content to the batch script
73+
mentioned above.
7474

7575
.. code-block:: bash
7676
77-
#SBATCH --cpus-per-task=2
78-
#SBATCH --mem-per-cpu=1750
77+
#SBATCH --cpus-per-task=2
78+
#SBATCH --mem-per-cpu=1750
7979
80-
module use /appl/local/csc/modulefiles
80+
module use /appl/local/csc/modulefiles
8181
82-
module load julia
83-
module load julia-amdgpu
82+
module load julia
83+
module load julia-amdgpu
8484
85-
julia --project=. -e 'using Pkg; Pkg.instantiate()'
86-
julia --project=. script.jl
85+
julia --project=. -e 'using Pkg; Pkg.instantiate()'
86+
julia --project=. script.jl
8787
8888
An example of the ``script.jl`` code is provided below.
8989

9090
.. code-block:: julia
9191
92-
using AMDGPU
93-
94-
A = rand(2^9, 2^9)
95-
A_d = ROCArray(A)
96-
B_d = A_d * A_d
97-
98-
println("----EOF----")
92+
using AMDGPU
9993
94+
A = rand(2^9, 2^9)
95+
A_d = ROCArray(A)
96+
B_d = A_d * A_d
10097
98+
println("----EOF----")
10199
102100
Running on Google Colab
103101
-----------------------
104102

105-
Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account.
103+
Google Colaboratory, commonly referred to as "Colab", is a cloud-based Jupyter notebook
104+
environment which runs in your web browser. Using it requires login with a Google
105+
account.
106106

107107
This is how you can get access to NVIDIA GPUs on Colab:
108108

109109
- Visit https://colab.research.google.com/ and sign in to your Google account
110110
- In the menu in front of you, click "New notebook" in the bottom right corner
111-
- After the notebook loads, go to the "Runtime" menu at the top and select "Change runtime type"
112-
- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU (e.g. T4)
113-
- Click "Save". The runtime takes a few seconds to load - you can see the status in the top right corner
114-
- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about the GPU.
111+
- After the notebook loads, go to the "Runtime" menu at the top and select "Change
112+
runtime type"
113+
- Select "GPU" under "Hardware accelerator" and choose an available type of NVIDIA GPU
114+
(e.g. T4)
115+
- Click "Save". The runtime takes a few seconds to load - you can see the status in the
116+
top right corner
117+
- After the runtime has loaded, you can type ``!nvidia-smi`` to see information about
118+
the GPU.
115119
- You can now write Python code that runs on GPUs through e.g. the numba library.
116120

117-
118121
Access to code examples
119122
-----------------------
120123

121-
Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git:
124+
Some exercises in this lesson rely on source code that you should download and modify in
125+
your own home directory on the cluster. All code examples are available in the same
126+
GitHub repository as this lesson itself. To download it you should use Git:
122127

123128
.. code-block:: console
124129
125-
$ git clone https://github.com/ENCCS/gpu-programming.git
126-
$ cd gpu-programming/content/examples/
127-
$ ls
128-
130+
$ git clone https://github.com/ENCCS/gpu-programming.git
131+
$ cd gpu-programming/content/examples/
132+
$ ls

content/1-gpu-history.rst

+73-63
Original file line numberDiff line numberDiff line change
@@ -1,131 +1,141 @@
11
.. _gpu-history:
22

3-
43
Why GPUs?
54
=========
65

7-
86
.. questions::
97

10-
- What is Moore's law?
11-
- What problem do GPUs solve?
8+
- What is Moore's law?
9+
- What problem do GPUs solve?
1210

1311
.. objectives::
1412

15-
- Explain the historical development of microprocessors and how GPUs enable
16-
continued scaling in computational power
13+
- Explain the historical development of microprocessors and how GPUs enable
14+
continued scaling in computational power
1715

1816
.. instructor-note::
1917

20-
- 15 min teaching
21-
- 0 min exercises
22-
18+
- 15 min teaching
19+
- 0 min exercises
2320

2421
Moore's law
2522
-----------
2623

27-
It states that the number of transistors in a dense integrated circuit doubles about every two years.
28-
More transistors means smaller size of a single element, so higher core frequency can be achieved.
29-
However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly.
30-
Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc.
24+
It states that the number of transistors in a dense integrated circuit doubles about
25+
every two years. More transistors means smaller size of a single element, so higher core
26+
frequency can be achieved. However, power consumption scales with frequency to the third
27+
power, therefore the growth in the core frequency has slowed down significantly. Higher
28+
performance of a single node has to rely on its more complicated structure and still can
29+
be achieved with SIMD (single instruction multiple data), branch prediction, etc.
3130

3231
.. figure:: img/history/microprocessor-trend-data.png
33-
:align: center
32+
:align: center
3433

35-
The evolution of microprocessors.
36-
The number of transistors per chip doubles roughly every 2 years.
37-
However, it can no longer be explored by the core frequency due to the power consumption limits.
38-
Before 2000, the increase in the single core clock frequency was the major source of the
39-
increase in the performance. Mid 2000 mark a transition towards multi-core processors.
34+
The evolution of microprocessors. The number of transistors per chip doubles roughly
35+
every 2 years. However, it can no longer be explored by the core frequency due to
36+
the power consumption limits. Before 2000, the increase in the single core clock
37+
frequency was the major source of the increase in the performance. Mid 2000 mark a
38+
transition towards multi-core processors.
4039

4140
Increasing performance has been sustained with two main strategies over the years:
4241

43-
- Increase the single processor performance:
42+
- Increase the single processor performance:
4443
- More recently, increase the number of physical cores.
4544

46-
4745
Computing in parallel
4846
---------------------
4947

50-
The underlying idea of parallel computing is to split a computational problem into smaller
51-
subtasks. Many subtasks can then be solved *simultaneously* by multiple processing units.
48+
The underlying idea of parallel computing is to split a computational problem into
49+
smaller subtasks. Many subtasks can then be solved *simultaneously* by multiple
50+
processing units.
5251

5352
.. figure:: img/history/compp.png
54-
:align: center
55-
56-
Computing in parallel.
53+
:align: center
5754

58-
How a problem is split into smaller subtasks strongly depends on the problem.
59-
There are various paradigms and programming approaches to do this.
55+
Computing in parallel.
6056

57+
How a problem is split into smaller subtasks strongly depends on the problem. There are
58+
various paradigms and programming approaches to do this.
6159

6260
Graphics processing units
6361
-------------------------
6462

65-
Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
66-
GPUs were initially developed for highly-parallel task of graphic processing.
67-
But over the years, they were used more and more in HPC.
63+
Graphics processing units (GPU) have been the most common accelerators during the last
64+
few years, the term GPU sometimes is used interchangeably with the term *accelerator*.
65+
GPUs were initially developed for highly-parallel task of graphic processing. But over
66+
the years, they were used more and more in HPC.
6867

69-
GPUs are a specialized parallel hardware for floating point operations.
70-
They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
71-
but it delegates highly-parallel tasks to the GPU.
72-
GPUs are based on highly parallel architectures, which allows taking advantage of the
73-
increasing number of transistors.
68+
GPUs are a specialized parallel hardware for floating point operations. They are
69+
basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
70+
but it delegates highly-parallel tasks to the GPU. GPUs are based on highly parallel
71+
architectures, which allows taking advantage of the increasing number of transistors.
7472

75-
Using GPUs allows one to achieve extreme performance per node.
76-
As a result, the single GPU-equipped workstation can outperform small CPU-based clusters
77-
for some type of computational tasks. The drawback is: usually major rewrites of programs is required
73+
Using GPUs allows one to achieve extreme performance per node. As a result, the single
74+
GPU-equipped workstation can outperform small CPU-based clusters for some type of
75+
computational tasks. The drawback is: usually major rewrites of programs is required
7876
with an accompanying change in the programming paradigm.
7977

8078
.. callout:: Host vs device
8179

82-
GPU-enabled systems require a heterogeneous programming model that involves both
83-
CPU and GPU, where the CPU and its memory are referred to as the host,
84-
and the GPU and its memory as the device.
80+
GPU-enabled systems require a heterogeneous programming model that involves both
81+
CPU and GPU, where the CPU and its memory are referred to as the host,
82+
and the GPU and its memory as the device.
8583

8684
.. figure:: img/history/CPU_and_GPU_separated.png
87-
:align: center
88-
89-
Figure adapted from the Carpentry `GPU Programming lesson <https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
85+
:align: center
9086

87+
Figure adapted from the Carpentry `GPU Programming lesson
88+
<https://carpentries-incubator.github.io/lesson-gpu-programming/>`__.
9189

9290
A look at the Top-500 list
9391
--------------------------
9492

95-
The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2024, where the columns show:
93+
The `TOP500 project <https://www.top500.org/>`__ ranks and details the 500 most powerful
94+
non-distributed computer systems in the world. The project was started in 1993 and
95+
publishes an updated list of the supercomputers twice a year. The snapshot below shows
96+
the top-5 HPC systems as of June 2024, where the columns show:
9697

97-
- **Cores** - Number of processors
98+
- **Cores** - Number of processors
9899
- **Rmax** - Maximal LINPACK performance achieved
99100
- **Rpeak** - Theoretical peak performance
100101
- **Power** - Power consumption
101102

102103
.. figure:: img/history/top-5.png
103-
:align: center
104+
:align: center
104105

105-
Snapshot from the `TOP500 list from June, 2024 <https://www.top500.org/lists/top500/2024/06/>`__.
106-
107-
All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs.
106+
Snapshot from the `TOP500 list from June, 2024
107+
<https://www.top500.org/lists/top500/2024/06/>`__.
108108

109+
All systems in the top-5 positions contain GPUs from AMD, Intel, or NVIDIA, except for
110+
Fugaku which instead relies on custom-built Arm A64FX CPUs.
109111

110112
Why GPUs?
111113
---------
112114

113-
- **Speed**: GPU computing can significantly accelerate many types of scientific workloads.
114-
- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations per watt of power consumed,
115-
which can result in significant energy savings. This is indeed evident from the `Green500 list <https://www.top500.org/lists/green500/2024/06/>`__.
116-
- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based systems for certain workloads.
117-
115+
- **Speed**: GPU computing can significantly accelerate many types of scientific
116+
workloads.
117+
- **Improved energy efficiency**: Compared to CPUs, GPUs can perform more calculations
118+
per watt of power consumed, which can result in significant energy savings. This is
119+
indeed evident from the `Green500 list
120+
<https://www.top500.org/lists/green500/2024/06/>`__.
121+
- **Cost-effectiveness**: GPUs can be more cost-effective than traditional CPU-based
122+
systems for certain workloads.
118123

119124
Limitations and drawbacks
120125
-------------------------
121126

122-
- **Only for certain workloads**: Not all workloads can be efficiently parallelized and accelerated on GPUs. Certain types of workloads, such as those with irregular data access patterns or high branching behavior, may not see significant performance improvements on GPUs.
123-
- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU computing could require specialized skills in GPU programming and knowledge of GPU architecture, leading to a steeper learning curve compared to CPU programming. Fortunately, if you study this training material closely you will become productive with GPU programming quickly!
124-
125-
127+
- **Only for certain workloads**: Not all workloads can be efficiently parallelized and
128+
accelerated on GPUs. Certain types of workloads, such as those with irregular data
129+
access patterns or high branching behavior, may not see significant performance
130+
improvements on GPUs.
131+
- **Steeper learning curve**: Depending on the GPU programming API that you choose, GPU
132+
computing could require specialized skills in GPU programming and knowledge of GPU
133+
architecture, leading to a steeper learning curve compared to CPU programming.
134+
Fortunately, if you study this training material closely you will become productive
135+
with GPU programming quickly!
126136

127137
.. keypoints::
128138

129-
- GPUs are accelerators for some types of tasks
130-
- Highly parallilizable compute-intensive tasks are suitable for GPUs
131-
- New programming skills are needed to use GPUs efficiently
139+
- GPUs are accelerators for some types of tasks
140+
- Highly parallilizable compute-intensive tasks are suitable for GPUs
141+
- New programming skills are needed to use GPUs efficiently

0 commit comments

Comments
 (0)