Parallel patterns

Design pattern / Programming pattern / Skeleton

General reusable solution to commonly recurring problems in software design / algorithm design.

Parallel patterns

Design patterns with good parallel implementations.


“Automatic parallelization” won’t work because the concepts in the parallel solutions are not always found in the sequential solutions.

Parallelization is problem and architecture dependent. There is no general strategy for parallelizing an algorithm.

Memory model

= When do data written by one processor become “visible” to other processors?

Different memory models for each implementation:

Shared memory programming model

implicit and explicit synchronization

UMA vs. NUMA - memory cost model and access time

Distributed memory programming model

Assumption: Data is distributed to processors in blocks before the processes start locally.

Cost of redistribution, communication, synchronization

Not all problems can be parallelized.

ie. Graph search problems (DFS, BFS, ...) are extremely difficult to parallelize:

Methodology to parallelize computations:“4 steps of Foster”

  1. Partitioning

    Dividing computation into independent tasks

  1. Communication

    Determining communication between tasks

  1. Agglomeration/aggregation

    Combining tasks, communication to larger independent “chunks”

  1. Mapping

    Assigning tasks, communication to processes, threads, ...

Data distribution

Data distributions (in parallel programming models)

block-wise block j contains elements [jn/p, jn/p+1, ..., 2jnp-1]

cyclic round-robin distribution: element i is in block i%p

block-cyclic blocks of size k distributed cyclically

irregular blocks of possibly different sizes → mapping global to block index

1) Block-wise

“Owner computes rule”: Block j associated with / owened by processor j .

Possibly better performance for NUMA and distributed memory systems (spatial locality).

2) Cyclic

Owner of element i can be determined in O(1)\small O(1) time and space.

3) Block-cyclic

Round-robin distribution of blocks of k elements.

Owner of element i can be determined in O(1)\small O(1) time and space.

4) Irregular

Owner of element i can be determined in O(1)\small O(1) time and space with O(n)\small O(n) sized array.

Otherwise compact data structure with fast lookup needed → then no longer O(1)\small O(1) .

(shown below for a single processor)

Loop

Loop patterns

Tpar(n,p)=max0j<p0i<n/pT(F(jn/p+i)) \small \text{Tpar}(n,p) = \max_{0 \leq j < p }\sum_{0 \leq i < n/p} T(F(j \cdot n / p + i)) 

par (0≤i<n) {
a[i] = F(i,n,…);
}

Data parallelism / data parallel pattern

Same operation applied to multiple elements (in data structure)

Map

Same operation applied to sequence.

SIMD parallelism / SIMD pattern

Single instruction (flow F) controls operation on many elements (in data structure)

Gather/Scatter

Gather/Scatter pattern

  1. Gather Collect elements of data structure into consecutive compact blocks.

    Make input “denser” / more compact

  1. Scatter Distribute elements of consecutive compact blocks into data structure.

    Compute

  1. Reduce (as in MapReduce) Combine result into single block.

    Insert result into previous input structure

MapReduce

MapReduce pattern

Available data assigned to available processors

then reduced with reduction operation (here: sum).

for (i=0; i<n; i++) {
a[i] = F(i,n,…);      // map
}
result = ∑ a[i];        // reduce
  • Example: counting words in a document
    1. Distribute assign text parts to processors
    1. Map each processor creates own dictionary with (word, count) pairs
    1. Map each processors sorts dictionary lexicographically after words
    1. Reduce all dictionaries are merged, for each word, all counts added

    (requires load balancing)

Stencil

Stencil pattern (one dimensional)

Goal: Iterating through array

Sequential: not data parallel, dependent iterations

for (i=0; i<n; i++) {
a[i] = a[i-1] + a[i] + a[i+1];
}

Parallel: Removing dependencies with extra space

for (i=n[j]; i<n[j+1]; i++) {
tmp[i] =a[i-1]+a[i]+a[i+1];
}
barrier;
x = tmp; tmp = a; a = x; // swap (more efficient than copying)

Synchronization: Processor j\small j needs access to a[n[j]-1] and a[n[j+1]] before they are updated by processors j1\small j-1 and j+1\small j+1 .

General stencil pattern

Given:

  • data structure A\small A

    ie. 2D matrix

  • update rule P(A,i,j,)\small P(A, i, j , \dots)

    ie. local neighborhood, template, stencil

    mostly O(1)\small O(1) → total work for update O(size(A))\small O(size(A))

    must have specification for borders

Pattern:

Update all elements using update rule.

Apply stencil on all elements A(i,j,)\small A(i,j, \dots) .

Stencil pattern parellelization through domain decomposition

Domain decomposition:

Dividing A\small A evenly over available processors

each responsible for updates on its subdomains.

Can be shared or distributed memory.

Issues:

Borders of subdomains: writing into the subdomain of other processor.

communication / synchronization with neighboring processors needed

Keeping updates synchronized.

Load balancing.

Data driven parallelization

Data distribution

Data distribution decides how we parallelize: “perform computations where the data is”

Assumptions: locality matters because of NUMA model

Paritioned Global Address Space PGAS

  1. Divide date structures across processors (based on some distribution)
  1. “Owner computes”-rule
  1. If processor accesses data owned by other processor: (implicit) communication.

Needs efficient ownership.

Example: Owner computes, compiler translates loops

for (i=n[j]; i<n[j+1]; i++) {
a[i] = f(i);
}
for (i=0; i<n; i++) {
if (owns(i,j)) a[i] = f(i);
}

Pipeline patten (MISD parallelism)

Linear pipeline (can be complex DAG)

Pipeline latency after k1\small k-1 steps a first output is produced

Pipeline throughput in each step, a new output block is produced - each step takes Fj(wi)\small F_j(w_i)

Master-worker (Asymmetric work-distribution)

Master-worker

Worker processor pi\small p_i requests work from master p0\small p_0 - returns solution


++ \quad Pattern can potentially keep all p\small p processors busy: Sp(n)=Θ(p)\small S_p(n) = \Theta(p) .

++ \quad Good if in subproblems work/size ratio is high and workers can compute asynchronously.

++ \quad Master solves “Termination detection problem”: no more subproblems → terminate.

- \quad Master can be sequential bottleneck.

Load balancing: Master-worker variation

Work-pool is distributed and workers have local pools.

Communication between workers allowed:

  • work-dealing overfull pools initiate redistribution
  • work-stealing processors with empty work-pool steal/request from others

Problem: Termination detection gets difficult.