Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 42 additions & 42 deletions content/1.01_GPUIntroduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,73 +6,73 @@ Introduction to GPU
Moore's law
-----------

The number of transistors in a dense integrated circuit doubles about every two years.
More transistors means smaller size of a single element, so higher core frequency can be achieved.
However, power consumption scales as frequency in third power, so the growth in the core frequency has slowed down significantly.
Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD, branch prediction, etc.
The number of transistors in a dense integrated circuit doubles every two years.
More transistors mean a smaller size of a single element so that a higher core frequency can be achieved.
However, power consumption scales as the frequency in the third power, so the growth in the core frequency has slowed down significantly.
The higher performance of a single node has to rely on its more complicated structure, but it can still be achieved with SIMD, branch prediction, etc.

.. figure:: Figures/Introduction/microprocessor-trend-data.png
:align: center

The evolution of microprocessors.
The number of transistors per chip increase every 2 years or so.
However it can no longer be explored by the core frequency due to power consumption limits.
Before 2000, the increase in the single core clock frequency was the major source of the increase in the performance.
Mid 2000 mark a transition towards multi-core processors.
The number of transistors per chip increases every two years or so.
However, it can no longer be explored by the core frequency due to power consumption limits.
Before 2000, the increase in the single-core clock frequency was the major source of the performance increase.
Mid-2000 marked a transition towards multi-core processors.

Achieving performance has been based on two main strategies over the years:

- Increase the single processor performance:
- Increase the single processor performance.

- More recently, increase the number of physical cores.
- More recently, increase in the number of physical cores.

Graphics processing units
-------------------------

The Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term accelerator.
GPUs were initially developed for highly-parallel task of graphic processing.
Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term accelerator.
GPUs were initially developed for highly parallel tasks of graphic processing.
Over the years, were used more and more in HPC.
GPUs are a specialized parallel hardware for floating point operations.
GPUs are co-processors for traditional CPUs: CPU still controls the work flow, delegating highly-parallel tasks to the GPU.
Based on highly parallel architectures, which allows to take advantage of the increasing number of transistors.
GPUs are specialized parallel hardware for floating point operations.
GPUs are co-processors for traditional CPUs: the CPU still controls the workflow, delegating highly parallel tasks to the GPU.
Based on highly parallel architectures, which allows us to take advantage of the increasing number of transistors.

Using GPUs allows one to achieve very high performance per node.
As a result, the single GPU-equipped workstation can outperform small CPU-based cluster for some type of computational tasks.
The drawback is: usually major rewrites of programs is required.
As a result, the single GPU-equipped workstation can outperform small CPU-based clusters for some types of computational tasks.
The drawback is that usually major rewrites of programs are required.

.. figure:: Figures/CUDA/CPUAndGPU.png
:align: center

A comparison of the CPU and GPU architecture.
CPU (left) has complex core structure and pack several cores on a single chip.
CPU (left) has a complex core structure and packs several cores on a single chip.
GPU cores are very simple in comparison, they also share data and control between each other.
This allows to pack more cores on a single chip, thus achieving very high compute density.
This allows to packing of more cores on a single chip, thus achieving very high compute density.

One of the most important features that allows the accelerators to reach this high performance is their scalability.
Computational cores on accelerators are usually grouped into multiprocessors.
The multiprocessors share the data and logical elements.
This alows to achieve a very high density of a compute elements on a GPU.
This allows us to achieve a very high density of compute elements on a GPU.
This also allows for better scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.


Accelerators are a separate main circuit board with the processor, memory, power management, etc.
It is connected to the motherboard with CPUs via PCIe bus.
Having its own memory means that the data has to be copied to and from it.
It is connected to the motherboard with CPUs via a PCIe bus.
Having its memory means that the data has to be copied to and from it.
CPU acts as a main processor, controlling the execution workflow.
It copies the data from its own memory to the GPU memory, executes the program and copies the results back.
GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management.
It copies the data from its memory to the GPU memory, executes the program, and copies the results back.
GPUs run tens of thousands of threads simultaneously on thousands of cores and do not do much of the data management.
With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly.
This makes the data management and its access pattern is essential on the GPU.
This makes the data management and its access pattern essential on the GPU.
Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly.
This allows to hide the memory operations: while some threads wait, others can compute.
This allows us to hide the memory operations: while some threads wait, others can compute.


Exposing parallelism
--------------------

The are two types of parallelism tha can be explored.
The data parallelism is when the data can be distributed across computational units that can run in parallel.
They than process the data applying the same or very simular operation to diffenet data elements.
Two types of parallelism can be explored.
Data parallelism is when the data can be distributed across computational units that can run in parallel.
They then process the data applying the same or very similar operation to different data elements.
A common example is applying a blur filter to an image --- the same function is applied to all the pixels on the image.
This parallelism is natural for the GPU, where the same instruction set is executed in multiple threads.

Expand All @@ -81,20 +81,20 @@ This parallelism is natural for the GPU, where the same instruction set is execu
:scale: 40 %

Data parallelism and task parallelism.
The data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed).
The task parallelism implies that there are more than one independent task that, in principle, can be executed in parallel.
Data parallelism is when the same operation applies to multiple data (e.g. multiple elements of an array are transformed).
Task parallelism implies that there is more than one independent task that, in principle, can be executed in parallel.

Data parallelism can usually be explored by the GPUs quite easily.
The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel.
If the number of elements in the data set if fairly large (tens or hundred of thousands elements), the GPU should perform quite well.
If the number of elements in the data set is fairly large (tens or hundreds of thousands of elements), the GPU should perform quite well.
Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take.
Getting absolute maximum out of the data parallelism requires good understanding of how GPU works.
Getting the absolute maximum out of the data parallelism requires a good understanding of how GPU works.


Another type of parallelism is a task parallelism.
This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data.
Another type of parallelism is task parallelism.
This is when an application consists of more than one task that requires performing different operations with (the same or) different data.
An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time.
Note that the tasks can consume totally different resources, which also can be explored.
Note that the tasks can consume different resources, which also can be explored.

Using GPUs
----------
Expand All @@ -105,13 +105,13 @@ From less to more difficult:

2. Use accelerated libraries

3. Directive based methods
3. Directive-based methods

- OpenMP

- OpenACC

4. Use lower level language
4. Use lower-level language

- **CUDA**

Expand All @@ -127,10 +127,10 @@ Summary

- GPUs are highly parallel devices that can execute certain parts of the program in many parallel threads.

- In order to use the GPU efficiency, one has to split their task in many sub-tasks that can run simultaneously.
- To use the GPU efficiency, one has to split their task into many sub-tasks that can run simultaneously.

- Running your application asynchronously allows to overlap different tasks, including data transfers, GPU and CPU compute kernel.
- Running your application asynchronously allows you to overlap different tasks, including data transfers, GPU and CPU compute kernel.

- Language extensions, such as CUDA, HIP, can give more performance, but harder to use.
- Language extensions, such as CUDA, and HIP, can give more performance, but are harder to use.

- Directive based methods are easy to implement, but can not leverage all the GPU capabilities.
- Directive-based methods are easy to implement, but can not leverage all the GPU capabilities.
22 changes: 11 additions & 11 deletions content/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@ Intro
Who is the course for?
----------------------

This course is for students, researchers, engineers and programmers who would like to learn GPU programming with CUDA.
Some previous experience with C/C++ is required, no prior knowledge of CUDA is needed.
This course is for students, researchers, engineers, and programmers who would like to learn GPU programming with CUDA.
Some previous experience with C/C++ is required; no prior knowledge of CUDA is needed.

Tentative schedule
------------------

Expand Down Expand Up @@ -99,16 +99,16 @@ Tentative schedule
About the course
----------------

These course materials are developed for those who wants to leark GPU programming with CUDA from the beginning.
The course consists of lectures, type-along and hands-on sessions.
These course materials are developed for those who want to learn GPU programming with CUDA from the beginning.
The course consists of lectures, type-along, and hands-on sessions.

During the first day, we will cover the architecture of the GPU accelerators, basic usage of CUDA, and how to control data movement between CPUs and GPUs.
The second day focuses on more advanced topics, such as how to optimize computational kernels for efficient execution on GPU hardware and how to explore the task-based parallelism using streams and events.
We will also briefly go through profiling tools that can help one to identify the computational bottleneck of the applications.
During the first day, we will cover the architecture of the GPU accelerators, the basic usage of CUDA, and how to control data movement between CPUs and GPUs.
The second day focuses on more advanced topics, such as how to optimize computational kernels for efficient execution on GPU hardware and how to explore task-based parallelism using streams and events.
We will also briefly go through profiling tools that can help one identify the computational bottleneck of the applications.

After the course the participants should have the basic skills needed for using CUDA in new or existing applications.
After the course, the participants should have the basic skills needed for using CUDA in new or existing applications.

The participants are assumed to have knowledge of C programming language.
The participants are assumed to know C programming language.
Since participants will be using HPC clusters to run the examples, fluent operation in a Linux/Unix environment is assumed.


Expand All @@ -120,7 +120,7 @@ See also
Credits
-------

The lesson file structure and browsing layout is inspired by and derived from
The lesson file structure and browsing layout are inspired by and derived from
`work <https://github.com/coderefinery/sphinx-lesson>`_ by `CodeRefinery
<https://coderefinery.org/>`_ licensed under the `MIT license
<http://opensource.org/licenses/mit-license.html>`_. We have copied and adapted
Expand Down