Skip to content

Commit 954f7ce

Browse files
authored
Merge pull request #95 from MonashDeepNeuron/dev
Dev
2 parents b5fec9b + 5bc81b6 commit 954f7ce

File tree

7 files changed

+50
-44
lines changed

7 files changed

+50
-44
lines changed

src/intro-to-parallel-comp/challenges.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Make sure to clone a copy of **your** challenges repo onto M3, ideally in a personal folder on vf38_scratch.
66

7-
> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](./job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-<job-name>.out`.
7+
> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](https://github.com/MonashDeepNeuron/HPC-Training-Challenges/blob/main/challenges/distributed-computing/job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-<job-name>.out`.
88
99
## Task 1 - Single Cluster Job using OpenMP
1010

src/intro-to-parallel-comp/intro-to-parallel-comp.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@ As introduced in chapter 5, parallel computing is all about running instructions
44

55
![query-parallelism](./imgs/query-parallelism.png)
66

7-
In this context, you can consider a query to be a job that carries out a series of steps on a particular dataset in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query?
7+
In this context, you can consider a query to be a job that carries out a series of steps on a particular input in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query?
88

99
This is where things like synchronisation, data/workload distribution and aggregation needs to be considered. In this chapter we will provide some theoretical context before learning how to implement parallelism using OpenMP & MPI.

src/intro-to-parallel-comp/locks.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Locks
22

33
Ealier, we have learnt about how to write concurrent programs, as well as a few constructs to achieve **synchronisation** in OpenMP. We know that:
4-
- `reduction construct` partitions shared data and uses barrier to achieve synchronisation
4+
- `reduction construct` partitions shared data and used a barrier to achieve synchronisation
55
- `atomic construct` utilises hardware ability to achieve thread-safe small memory read/write operations.
66

77
What about `critical construct`? We said that it uses locks, but what are locks?
88

99
> Notes that the direct use of locks is **not recommended** (at least in OpenMP):
10-
> - It is very easy to cause deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter).
10+
> - It is very easy to cause a deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter).
1111
> - It can often cause very poor performance or worse.
1212
> - It generally indicates that the program design is wrong.
1313
>

src/intro-to-parallel-comp/message-passing.md

+11-6
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Message Passing
22

3-
As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network.
3+
As each processor has its own local memory with its own address space in distributed computing, we need a way to implement communication between the distributed processes and allow data sharing. Message passing is the mechanism of exchanging data between processes. Each process can communicate with one or more other processes by sending messages over a network.
44

5-
The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability.
5+
The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments. The main goals of this protocol standard is high performance, scalability, and portability.
66

77
OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc.
88

@@ -125,9 +125,13 @@ int MPI_Finalize(void);
125125

126126
```
127127
128-
Use man pages to find out more about each routine
128+
Terminology:
129+
- **World Size**: The total no. of processes involved in your distributed computing job.
130+
- **Rank**: A unique ID for a particular process.
129131
130-
When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office)
132+
> Use OpenMPI man pages to find out more about each routine
133+
134+
When sending data to a process, it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office)
131135
132136
### Elementary MPI Data types
133137
@@ -257,8 +261,9 @@ The command top or htop looks into a process. As you can see from the image belo
257261
- The command ```time``` checks the overall performance of the code
258262
- By running this command, you get real time, user time and system time.
259263
- Real is wall clock time - time from start to finish of the call. This includes the time of overhead
260-
- User is the amount of CPU time spent outside the kernel within the process
261-
- Sys is the amount of CPU time spent in the kernel within the process.
264+
- User is the amount of CPU time spent outside the kernel within the process
265+
- Sys is the amount of CPU time spent in the kernel within the process.
262266
- User time +Sys time will tell you how much actual CPU time your process used.
263267

268+
264269
![time](imgs/time.png)

src/intro-to-parallel-comp/multithreading.md

+15-21
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,32 @@
11
# Multithreading
22

3-
We have all looked at the theory of threads and concurrent programming in the Operating System chapter. Now, we will shift our focus to OpenMP and its application for executing multithreaded operations in a declarative programming style.
3+
Hopefully by now you are all familiar with multi-threading and how parallel computing works. We'll now go through how to implement parallel computing using OpenMP in order to speed up the execution of our C programs.
44

55
## OpenMP
66

7-
OpenMP is an Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism in C/C++ programs. It is not intrusive on the original serial code in that the OpenMP instructions are made in pragmas interpreted by the compiler.
8-
9-
> Further features of OpenMP will be introduced in conjunction with the concepts discussed in later sub-chapters.
7+
OpenMP is an Application Program Interface (API) that is used to implement multi-threaded, shared memory parallelism in C/C++ programs. It's designed to be a very minimal add-on to serial C code when it comes to implementation. All you have to do is use the `#pragma` (C preprocessor directives) mechanism to wrap the parallel regions of your code.
108

119
### Fork-Join Parallel Execution Model
1210

13-
OpenMP uses the `fork-join model` of parallel execution.
11+
OpenMP uses the *fork-join model* of parallel execution.
1412

15-
* **FORK**: All OpenMP programs begin with a `single master thread` which executes sequentially until a `parallel region` is encountered, when it creates a team of parallel threads.
13+
* **FORK**: All OpenMP programs begin with a *single master thread* which executes sequentially until a *parallel region* is encountered. After that, it spawns a *team of threads* to carry out the multi-threaded parallel computing.
1614

17-
The OpenMP runtime library maintains a pool of threads that can be added to the threads team in parallel regions. When a thread encounters a parallel construct and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team.
15+
The OpenMP runtime library maintains a pool of potential OS threads that can be added to the thread team during parallel region execution. When a thread encounters a parallel construct (pragma directive) and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team.
1816

19-
* **JOIN**: Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially.
17+
This speeds up the process of thread spawning by using a *warm start* mechanism to minimise overhead associated with the kernel scheduler context switching needed to conduct thread spawning.
2018

21-
![Fork - Join Model](./imgs/fork-join.png)
19+
> If you're unclear how the kernel scheduler context switching works, revisit the operating systems chapter and explore/lookup the topics introduced there.
20+
21+
* **JOIN**: Once the team of threads complete the parallel region, they **synchronise** and return to the pool, leaving only the master thread that executes sequentially.
2222

23-
> We will look a bit more into what is synchronisation as well as synchronisation techniques in the next sub-chapter.
23+
![Fork - Join Model](./imgs/fork-join.png)
2424

2525
### Imperative vs Declarative
2626

27-
Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow.
27+
Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow i.e. you tell the computer what to do instead of *how to do it*.
2828

29-
OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring it using pragma.
29+
OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring pragma directives in our code.
3030

3131
![Structure Overview](./imgs/program-structure.png)
3232

@@ -52,24 +52,22 @@ int main() {
5252

5353
## Running on M3
5454

55-
Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12`
55+
Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12`.
56+
57+
> The `export` statement is a bash command you can type into a WSL/Linux terminal. It allows you to set environment variables in order to manage runtime configuration.
5658
5759
```bash
5860
#!/bin/bash
5961
# Usage: sbatch slurm-openmp-job-script
6062
# Prepared By: Kai Xi, Apr 2015
61-
62-
6363
# NOTE: To activate a SLURM option, remove the whitespace between the '#' and 'SBATCH'
6464

6565
# To give your job a name, replace "MyJob" with an appropriate name
6666
# SBATCH --job-name=MyJob
6767

68-
6968
# To set a project account for credit charging,
7069
# SBATCH --account=pmosp
7170

72-
7371
# Request CPU resource for a openmp job, suppose it is a 12-thread job
7472
# SBATCH --ntasks=1
7573
# SBATCH --ntasks-per-node=1
@@ -81,24 +79,20 @@ Here is a template script provided in the home directory in M3. Notice that we c
8179
# Set your minimum acceptable walltime, format: day-hours:minutes:seconds
8280
# SBATCH --time=0-06:00:00
8381

84-
8582
# To receive an email when job completes or fails
8683
# SBATCH --mail-user=<You Email Address>
8784
# SBATCH --mail-type=END
8885
# SBATCH --mail-type=FAIL
8986

90-
9187
# Set the file for output (stdout)
9288
# SBATCH --output=MyJob-%j.out
9389

9490
# Set the file for error log (stderr)
9591
# SBATCH --error=MyJob-%j.err
9692

97-
9893
# Use reserved node to run job when a node reservation is made for you already
9994
# SBATCH --reservation=reservation_name
10095

101-
10296
# Command to run a openmp job
10397
# Set OMP_NUM_THREADS to the same value as: --cpus-per-task=12
10498
export OMP_NUM_THREADS=12

src/intro-to-parallel-comp/parallel-algos.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,4 @@ This is how one of the simplest parallel algorithms - **parallel sum** works. Al
3737

3838
![parallel-sum](./imgs/parallel-sum-diagram.png)
3939

40-
Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your dataset simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images.
40+
Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your input simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images.

0 commit comments

Comments
 (0)