Prefix-sums, reduction

Collective opereation (pattern)

Reduction and prefix-sums are collective operations.

A set of processors cooperatively carrying out an operation on a data-set.

Other examples
Broadcast One processor has data that all the others receive
Scatter One processor has data that all the other’s receive in their blocks
Gather Blocks from all processors get collected by one processor
Allgather Blocks from all processors get collected by all processors
also called: broadcast-to-all
Alltoall Each processor collects block from other processor

Definition: Reduction

Reduction problem

Given sequence $\small x= (x_{0},~x_{1},~x_{2},~\dots, x_{n-1})$

elements can be: integers, real numbers, vectors, ...

and associative operation $\small +$

has algebraic property: $\small x+(y+z)=(x+y)+z$

compute

$\small y = \sum_{0\leq i < n} x_i = x_{0}+ x_{1}+x_{2}+\dots x_{n-1}$

Examples: Reduction as collective operation
- Reduction-to-one
  All processors participate. Result stored in “root”-processor.
- Reduction-to-all
  All processors participate. Result available to all.
- Reduction-with-scatter
  Using vectors. Result stored in blocks over the processors according to some rule.

Definition: Prefix-sums

Prefix sums problem

Given sequence $\small x$ compute all inclusive/exclusive prefix sums $\small (y_0,~y_1,~\dots,~y_{n-1})$ .

$\footnotesize i > 0$ and there is a special definition for $\small i = 0$

$\small \tt exscan(i,n)$

Exclusive prefix sum, computed by process $\small i$

(can’t be derived from inclusive sum unless $\small +$ has an inverse operation).

$\small y_i = \sum_{0\leq j \textcolor{pink}{<i}} x_j = x_{0}+ x_{1}+x_{2}+\dots x_{\textcolor{pink}{i-1}}$

in $\color{pink}\small O(n/p + \log n)$

$\small \tt scan (i,n)$

Inclusive prefix sum (including $\small x_i$ ), computed by process $\small i$

(same as exclusive prefix sum + $\small x_i$ )

$\small y_i = \sum_{0\leq j \textcolor{pink}{\leq i}} x_j = x_{0}+ x_{1}+x_{2}+\dots x_{\textcolor{pink}{i}}$

Theorem: Speedup is at most $\small p/2$

For computing the prefix-sums of an $\small n$ input sequence, the following trade-off holds between:

$\small s$ size, number of $\small +$ operations

$\small t$ depth, $\small T_\infin$

$\small s+t \geq 2n-2$

Sequential solution

Solution to both reduction and prefix sums problem.

This is the best possible, since output depends on $\small x_i$ .

Work $\color{pink}\small \text{Wseq}(n) \in O(n)$ total operations

Fastest possib. parallel $\color{pink}\small T_\infin(n) \in O(n)$

applications of $\footnotesize +$ $\color{pink}\small (n-1)$

time complexity $\color{pink}\small \text{Tseq}(n)\in\Theta(n )$

implementation (not the best)
```
for (i=1; i<n; i++) {
x[i] = x[i-1] + x[i];
}
```
$\small 2(n-1)$ memory reads
$\small \text{Tseq}(n) = n-1 = O(n)$ summations

implementation - improved

1 read, 1 write per iteration

optimizer in compiler can improve performance significantly

register int sum = x[0];  // compiler stores in register for faster access
for (i=1; i<n; i++) {
sum += x[i];  // read
x[i] = sum;   // write
}

Parallel solution (intuitively)

Solution to both reduction and prefix sums problem.

What we ideally want:

Work $\color{pink}\small \text{Wpar}(n) \in \text{Wseq}(n) \in O(n)$

Fastest poss. parallel $\color{pink}\small T_\infin(n) \in O(\log n)$

applications of $\footnotesize +$ close to $\color{pink}\small (n-1)$

time complexity $\color{pink}\small \text{Tpar}(n) \in O(n/p+ T_{\infin}(n))$ for large range of $\small p$

In most architectures $\small \Omega(\log n)$ is reasonable for a work-optimal solution.

We want to make use of the associativity of the operator:

There are 3 parallel solutions to inclusive prefix sums:

recursive fast, work-optimal

iterative fast, work-optimal

doubling faster, not work-optimal but still useful

List ranking (= Generalization of prefix-sums problem)
Generalization of prefix-sums problem:
Given list $\small (x_0 \rarr x_1 \rarr x_2 \rarr \dots \rarr x_{n-1})$ compute all $\small n$ list-prefix sums.
Find list head and follow the pointers and sum up: $\small O(n)$ .
Not solvable with blocking technique.
List stored in array but indices have no relation to position in list.
Index of first and last element (head and tail) may not be known.
can be solved in parallel in $\small O(n/p + \log n)$ parallel time steps.

1) Recursive

Recursive solution

Sum pairwise, then recurse on smaller problems.

Scan(x,n) {
if (n==1) {
return;
}for (i=0; i<n/2; i++) {
y[i] = x[2*i] + x[2*i+1];
}// solve recursively in y -> implicit (with fork-join) or explicit barrier
Scan(y,n/2);// take back
x[1] = y[0];
for (i=1; i<n/2; i++) {
x[2*i] = y[i-1] + x[2*i];
x[2*i+1] = y[i];
}if (odd(n)) {
x[n-1] = y[n/2-1] + x[n-1];
}
}

Example

for (i: 0 --> 3) {
y[i] = x[2*i] + x[2*i+1];
}
x = [1, 2, 1, 4, 3, 2, 3, 6]   n = 8
\__/  \__/  \__/  \__/
3     5     5     9
y = [3, 5, 5, 9]Scan(y,n/2);
y = [3, 8, 13, 22]x[1] = y[0];
x = [1, 3, 1, 4, 3, 2, 3, 6]for (i: 1 --> 3) {
x[2i] = y[i-1] + x[2i];
x = [1, 3, 4, 4, 11, 2, 16, 6]x[2i+1] = y[i];
x = [1, 3, 4, 8, 11, 13, 16, 22]
}

Number of operations, recursive calls
All for loops are data parallel → perfectly parallelizable:
Work of each for-loop: $\small O(n/2) = O(n)$
Time of each for-loop: $\small O(n/2p) = O(n/p)$
Total number of operations of $\footnotesize +$ operations $\small W(n)$ :
$\small W(1) \in O(1)$
$\small W(n) \leq n + W(n/2) \leq n + 2(n/2) = 2n \in O(n)$
$\small W(n) = n+ W(n/2) = n + (n/2) + (n/4) + ...+ 1 \\ = n + 2(n/2)$ (geometric series)
Number of recursive calls $\small T(n)$ which will be $\small T_\infin$ satisfies:
$\small T(1) \in O(1)$
$\small T(n) = 1 + T(n/2) = 1 + 1 + T(n/4) = ...$
$\small T(n)\leq 1+ \log_2(n)$
- Proof
  $\small T(1) = 1 + \log_2(1) = 1+0$
  $\small T(n) = \\ \quad 1 + T(n/2) \leq \\ \quad 1 + ({1+\log_2(n/2)}) = \\ \quad 1 + (1+ \log_2(n) - \log_2(2)) = \\ \quad 1 + (1+\log_2(n)-1) = \\ \quad \textcolor{pink}{1 + \log_2(n)}$

Correctness (proof through induction)
Claim:
Algorithm computes the inclusive prefix sums of $\small x = (x_0,x_1,\dots, x_{n-1})$ .
That means $\small x_i = \sum_{0\leq j \leq i} X_j$
$\small X_j$ is the value of $\small x_j$ before the call
Proof by induction:
Base $\small n = 1$ is correct: The algorithm does nothing and just returns.
Pairing:
$\small y_i = \sum_{0 \leq j \leq i} Y_j = \sum_{0 \leq j \leq i} (X_{2j} + X_{2j+1})$
Then calling Scan(y,n/2) .
Therefore:
odd $\small i$ $\small x_i = y_{(i-1)/2}$
even $\small i$ $\small x_i = y_{i/2-1} + X_i$

Summary

Work $\color{pink}\small W(n) \in O(n)$

Time $\textcolor{pink}{\small \text{Tpar}(p,n) = }W(n)/p + T_\infin(n) = \textcolor{pink}{O(n/p+\log n)}$

Parallelism $\color{pink} \large \frac{\text{Tseq}(n)}{T_\infin(n)} \small = n/\log n$ processors

Parallel steps $\color{pink}\small T_\infin(n) = 2 \log n$ parallel steps (recursive calls)

2 synchronizations per recursive call

$\small +$ operations $\color{pink}\small 2n \in O(n)$

Advantages

Smaller $\small y$ array might fit in cache

pair-wise summing has good spatial locality

Drawbacks

space: extra $\small n/2$ sized array in each recursive call ( $\small n$ in total)

About $\small 2n$ operations of $\small +$ (compared to sequential: $\small n-1$ )

$\small 2(\log_2(n))$ parallel steps

2) Iterative

Eliminates recursion and extra space.

Can only compute sum for arrays with $\small n= 2^k$ elements where $\small k = 0, 1, \dots, \log_2(n)-1$

in $\small \log_2(n)$ rounds:

In round $\small i$ , the $\small +$ operation is done for every $\small 2^{i+1}$ th element.

This way:

$\small x[2k-1] = \sum_{0 \leq i <\textcolor{pink}{2^k}} x_i$

Lemma

Reduction can be performed out in $\small r = \log_2(n)$ synchronized rounds for $\small n$ (if it’s a power of two).

Total number of $\small +$ operations are $\small$ $\small n/2+ n/4 + n/8 + \dots < n ~(= n-1)$

Geometric series $\small \sum_{0 \leq i \leq n} a\cdot r^i = a\cdot \large \frac{1-r^{n+1}}{1-r}$

Iterative solution

for (k=1; k<n; k=kk) {
kk = k<<1; // double
par (i=kk-1; i<n, i+=kk) {
x[i] = x[i] + x[i-k];
}barrier;
}

for (k=1; k<n; k=kk) {
kk = k<<1; // halve
par (i=kk-1; i<n, i+=kk) {
x[i] = x[i] + x[i-k];
}barrier;
}

Example 1
Untitled-2022-05-23-1804.excalidraw

Example 2
Untitled-2022-05-23-1804%201.excalidraw

Parallelization of inner loop

The inner for-loop can be parallelized (not the outer one).

This way we can get from $2^{k+1}$ operations in each round to $\small n/2^{n+1}$ .

Total work $\small \approx 2n \in O(\text{Tseq}(n))$

Factor $\small 2$ is off from sequential $\small n-1$ work.

Invariants

Invariant for Up-Phase
$\small x[i] = X[i-2^k+1]+\dots + X[i]$
$\small j = 1, 2, 3, \dots$
$\small i = j\cdot 2^k-1$
$\small k = 0, 1, \dots, \lfloor \log_2(n)\rfloor$

Invariant for Down-Phase
$\small x[i] = \sum_{0 \leq j \leq i} X[i]$
$\small j = 1, 2, 3, \dots$
$\small i = j\cdot 2^k-1$
$\small k = \lfloor \log_2(n)\rfloor, \lfloor \log_2(n)\rfloor-1, \dots$

Distributed programming models

Synchronization can get very expensive for large $\small n$ relative to $\small p$ in in distributed memory models.

$\small 2\cdot \log_2(n)$ communication rounds each with $\small n/2^k$ $\small$ concurrent communication operations.

$\small 2n$ operations in total.

Since often $\small n \gg p$ this is too expensive.

Summary

Work $\color{pink}\small \text{W}(n)\approx 2n \in O(\text{Tseq}(n))$

Time $\small 2\cdot \lfloor\log_2(n)\rfloor$ rounds each $\small O(n/2^{n+1})$ = $\color{pink}\small O(n/2^{n} \cdot \log n)$

Linear speedup $\color{pink}\small S_p(n) \in O(p/2)$ - half the processors are lost

$\small +$ operations about $\color{pink}\small 2n \in O(n)$

Advantages

In-place (extra space required: input and output are the same array)

Work-optimal, simple, parallelizable loops

No recursive overhead

Disadvantages

Less cache-friendly than recursive solution - less spacial locality with increasing step-size.

$\small 2\cdot \lfloor\log_2(n)\rfloor$ rounds

About $\small 2n$ operations with $\small +$

3) Doubling

Faster butnot work optimalsolution.

In each round every processor computes: we are not using every $\small k$ th processor as in the previous solution.

Invariance

$\small x[i] = x[i-k] + x[i]$

$\small k = 2^{k'}$

Round $\small k'$

with $\small \lceil \log_2n\rceil$ rounds needed in total.

int *y = (int)malloc(n*sizeof(int));
int *t;for (k=1; k<n; k=k*2) {
par (i=0; i<k; i++) y[i] = x[i]; // skip first elements
par (i=k; i<n; i++) y[i] = x[i-k]+x[i]; // add
barrier;
t = x; x = y; y = t; // swap x, y
}

Example
Untitled.txt

Invariant

Before each iteration of step $\small k$ :

$\small \forall i : x[i] = \sum_{\max(0, i-2^k-1)\leq j \leq i }X[j]$

Summary

Advantages

Only $\small \lceil\log_2 p \rceil$ rounds (and synchronization / communication steps)

Simple parallelizable loops

no recursive overheads

Drawbacks

not work-optimal

less spatial locality, less cache friendly with increasing step-size $\small k$

Extra array of size $\small n$ needed to eliminate dependencies