Parallel Computing#

General thoughts#

Parallel computing is a programming method that harnesses the power of multiple processors (typically CPU cores) at once.

There are many types of parallelism, some of which are (from micro to macro)

Instruction level parallelism (e.g. SIMD)
Multi-threading (shared memory)
Multi-processing (shared system memory)
Distributed processing (typically no shared memory)

Import note before we start: At the center of an efficient parallel code is a fast serial code!!

Why Go Parallel?#

Interesting video on the topic of “The Future of Microprocessors” https://www.youtube.com/watch?v=zX4ZNfvw1cw (coincidentally from Juliacon :P)

When to Go Parallel?#

If parts of your (optimized!) serial code aren’t fast enough.
- There are costs: parallelization typically increases the code complexity
If your system has multiple execution units (CPU threads, GPU threads, …).
- Import on supercomputers, but also on modern desktop computers and laptops

What Do I Have?#

using Hwloc
Hwloc.num_physical_cores()

Note that there may be more than one CPU thread per physical CPU core (e.g. hyperthreading).

Sys.CPU_THREADS

What does Maxwell Have?#

The Maxwell Infrastructure page summarises the hardware:

Compute Hardware		Infiniband Hardware		Storage
CPU+GPU nodes	798	root switches	6	GPFS exfel	~40 PB
Total number of cores with hyperthreading	61696	top switches	12	GPFS petra3	~20 PB
Total number of PHYSICAL cores	30898	leaf switches	42	BeeGFS desy	1.5 PB
Theoretical CPU peak performance	1074 TFlops	IB cables (#)	>1432	BeeGFS cssb	3.2 PB
Total RAM	420 TB	IB cables (length)	>7.6km
GPU nodes	180
Total number of GPUs	379
Theoretical GPU peak performance	2330 TFlops
Total peak performance	3404 TFlops1

There are two main kinds of nodes on Maxwell:

HT Cores	Cores	CPUs	CPU
~160	~20	2x	Intel E5-2698
256	64	2x	AMD EPYC 7542

Note that:

Few different types of Intel CPUs, between 18 and 20 cores/cpu
Hyperthreaded cores = 2 (physical CPUs) * 64 (cores/CPU) * 2 (threads/core) = 256 HT Cores for EPYC, similar for Intel

Even if you only use a single node you have access to 128 CPU cores (64 per CPU). Hence, if you would use only a single core, the node utilization would be less than 1%.

Amdahl’s Law#

Naive strong scaling expectation: I have 4 cores, give me my 4x speedup! However that is not the case:

The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used

More formally:

If $p$ is the fraction of a code that can be parallelized than the maximal theoretical speedup by parallelizing on $n$ cores is given by $$ F(n) = 1/(1-p + p/n) $$

using Plots
F(p, n) = 1 / (1 - p + p / n)

pl = plot()
for p in reverse(sort(vcat(0.2:0.2:1, [0.9, 0.95])))
    plot!(pl, n -> F(p, n), 1:16, lab="$(Int(p*100))%", lw=2,
        legend=:topleft, xlab="number of cores", ylab="parallel speedup", frame=:box)
end
pl

../../_images/1-parallelism-concurrency_7_0.svg

Parallel Computing in Julia#

Julia provides support for all types of parallelism mentioned above (same order)

@simd, SIMD.jl, LoopVectorization.jl
Threads.@threads, Threads.@spawn, FLoops.jl, ThreadsX.jl …
@spawnat, @fetch, RemoteChannel, SharedArray
@spawnat, @fetch, RemoteChannel, DistributedArrays.jl, MPI.jl

With supercomputing in mind, we will start by focusing on multi-process parallelism which allows us to utilize multiple cores on the same or different nodes/machines (distributed computing).

But before we do, it’s instructive to take a closer look at tasks.

Tasks#

By default, Julia waits for every command to finish (”blocking”) and run everything sequentially.

Tasks are a control flow feature that allows computations to be suspended and resumed in a flexible manner to implement cooperative multitasking. (This feature is sometimes called by other names, such as coroutines, green-, or lightweight threads.)

Tasks are managed by Julia and can be run in a concurrent fashion.

Concurrency means executing multiple tasks at the same time but not necessarily simultaneously.

An important use case is asynchronous I/O, which is typically slow. Examples are:

multiple user input (Why not already process some of the input?)
data dumping to disk (Maybe it’s possible to continue a calculation?)
receiving calculations from worker processes

`@async` and `@sync`#

We can create and schedule a task for asynchronous execution with the @async macro.

What this means is that for whatever falls into its scope, Julia will start a task to then proceed to whatever comes next in the script without waiting for the task to complete (”non-blocking”).

@time sleep(2);

  2.000862 seconds (66 allocations: 1.688 KiB)

@time @async sleep(2)

  0.012212 seconds (12.74 k allocations: 724.993 KiB, 53.87% compilation time)

Task (runnable) @0x00007f8ec4db12c0

Julia allows the script to proceed (and the @time macro to fully execute) without waiting for the task (in this case, sleeping for two seconds) to complete.

We can use the partner macro @sync to synchronize, that is wait for all encapsulated tasks. (see ?@sync).

@time @sync @async sleep(2)

  2.018049 seconds (811 allocations: 46.063 KiB, 0.89% compilation time)

Task (done) @0x00007f8eafe41430

Of course, here it doesn’t make much sense to write @sync @async - we could simply drop it altogether. A better example is the following.

@time @sync begin
    @async sleep(2.0)
    @async sleep(2.0)
end

  2.005250 seconds (1.03 k allocations: 63.416 KiB, 0.21% compilation time)

Task (done) @0x00007f8eafe41e40

A = rand(1000, 1000)
B = rand(1000, 1000)

@time t = @async A * B;

  0.000065 seconds (32 allocations: 2.148 KiB)

@time A * B;

  0.012130 seconds (2 allocations: 7.629 MiB)

wait(t)

fetch(t)

1000×1000 Matrix{Float64}:
227  260.736  259.512  255.118  …  265.802  248.578  254.68   256.749
056  249.929  245.461  245.267     246.204  235.374  238.392  246.51
156  256.235  262.714  254.409     259.868  243.649  247.781  257.38
703  249.144  251.91   247.845     252.284  233.832  236.001  251.144
563  252.364  256.756  252.737     248.212  237.178  238.494  247.611
03   262.196  263.404  258.77   …  258.135  245.969  250.906  262.044
881  247.229  249.466  254.335     251.519  241.021  240.248  251.509
177  254.725  251.597  247.423     254.867  239.108  243.191  247.311
562  251.673  254.265  248.943     254.717  236.701  240.407  252.59
121  249.528  252.028  247.476     248.637  238.063  239.744  244.09
735  253.694  259.403  252.126  …  263.0    236.995  249.058  255.549
743  245.02   251.59   245.144     250.504  241.986  242.218  254.669
47   252.945  251.416  247.651     248.537  240.868  240.528  256.444
   ⋮                                 ⋱                             
338  256.465  258.023  253.54      255.232  239.656  246.433  249.67
681  253.031  256.69   255.451     255.013  239.314  242.998  253.036
115  249.416  254.989  253.077  …  250.37   239.385  243.714  251.449
733  257.466  260.173  254.909     262.988  241.957  245.714  256.698
562  257.363  255.926  253.865     255.307  243.291  244.269  250.139
507  261.683  262.275  259.922     266.299  246.437  252.156  258.814
892  254.217  255.454  249.503     253.837  238.121  237.577  252.802
049  252.732  256.281  249.84   …  255.621  241.694  248.631  250.756
55   260.653  261.914  249.54      258.547  243.586  248.722  259.854
194  261.258  258.801  252.731     256.4    245.783  249.013  254.395
415  251.312  254.345  250.555     255.92   245.139  238.228  250.447
399  259.1    263.069  256.273     257.411  239.395  241.122  257.062

function io_bound_task()
    sleep(5.0)
    return true
end

io_bound_task (generic function with 1 method)

@time my_io_bound_task = @async io_bound_task()

@time fetch(my_io_bound_task)

  0.000039 seconds (30 allocations: 2.070 KiB)

  4.986094 seconds (254 allocations: 13.031 KiB)

true

Julia for HPC
EuXFEL Workshop

Parallel Computing

Contents

Parallel Computing#

General thoughts#

Why Go Parallel?#

When to Go Parallel?#

What Do I Have?#

What does Maxwell Have?#

Amdahl’s Law#

Parallel Computing in Julia#

Tasks#

`@async` and `@sync`#

Julia for HPCEuXFEL Workshop

Parallel Computing

Contents

Parallel Computing#

General thoughts#

Why Go Parallel?#

When to Go Parallel?#

What Do I Have?#

What does Maxwell Have?#

Amdahl’s Law#

Parallel Computing in Julia#

Tasks#

@async and @sync#

Julia for HPC
EuXFEL Workshop

`@async` and `@sync`#