Applications prefix-sums, reduction

Applications of reduction

Cutoff computation

done = allreduce(localdone, AND);

localdone = $\small \forall i: x[i] < \varepsilon$ (convergence check)

all $\small i$ are local

here AND is the associative operation, and we distribute result to all processes ( Reduction-to-all ).

// Parallelizable part
do {
for (i=0; i<n; i++) {
x[i] = f(i);
}
// convergence check
done = allreduce(localdone, AND);
} while (!done)

Applications of prefix-sums

Array compaction: Load balancing

Sequential solution

Input arrays: a , active

for (i=0; i<n; i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}

Work: $\small O(n + |\texttt{active}| \cdot f)$

where |active| ist the number of indices set to true

Solution: Without array compaction

There are no dependencies between loop iterations → Data parallel computation.

par (0 <= j < p) {
// i: j*(n/p) ---> (j+1)*(n/p)-1
for (i=j*(n/p); i<(j+1)*(n/p); i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}
}

Work: $\small O(n) + O( |\texttt{active}| \cdot f)$

Time: $\small O( |\texttt{active}| \cdot f)$

Static assigment of work to processors → load inbalance: processors with active[i] == false do little work.

Solution: With array compaction

Lower work to $\small O(|\texttt{active}| \cdot f)$ by iterating only over active indices:

Split into blocks of size $\small |\texttt{active}|/p$

par (i=0; i<n; i++) {
index[i] = active[i] ? 1 : 0;
}Exscan(index,n); // exclusive prefix computationm = index[n-1] + (active[n-1] ? 1 : 0); // m = |active| -> used belowpar (i=0; i<n; i++) {
if (active[i]) {
compact[index[i]] = i;
}
}

par (j=0; j<m; j++) {
i = compact[j];
a[i] = f(b[i]+c[i]);
}

Example

            0     1      2      3     4      -> n = 5
active  = [true, true, false, false, true]   -> m = 3
index   = [ 1,    1,     0,     0,    1  ]
↓ Exscan
index   = [ 0,    1,     2,     2,    2  ]

Now we can copy values with their index for the new compact array in index .

Work: $\small O(n) + O(|\texttt{active}| \cdot f)$

Time: $\small O(n/p) + \text{Exscan}(p,n) + O((|\texttt{active}| \cdot f)/p)$

where $\small \text{Exscan}(p,n) = O(n/p + \log n)$

Array compaction: Partitioning for QuickSort

Quicksort algorithm

Quicksort(a, n)

Select pivot $\footnotesize \tt a[k]$

Partition $\footnotesize \tt a$ into
$\footnotesize \tt a[0,...,n_1-1]$ elements < pivot
$\footnotesize \tt a[n_1,...,n_2-1]$ elements = pivot
$\footnotesize \tt a[n_2,...,n-1]$ elements > pivot

In parallel:
$\footnotesize \tt Quicksort(a,n_1)$
$\footnotesize \tt Quicksort(a+n_2,~n-n_2)$

Parallel partitioning algorithm

par (i=0; i<n; i++) {
index[i] = (a[i]<a[k]) ? 1 : 0;
}Exscan(index,n); // exclusive prefix computation
n1 = index[n-1]+(active[n-1] ? 1 : 0); // num of smaller elementspar (i=0; i<n; i++) {
if (a[i]<a[k]) {
aa[index[i]] = a[i]; // store value of each element smaller than a[k]
}
}// same for equal to and larger than pivot elements
...// copy back
par (i=0; i<n1; i++) {
a[i] = aa[i];
}// same for equal to and larger than pivot elements
...

Example: getting 3 partitions

find smaller

         0  1  2  3  4       n = 5 (indices)
a =     [1, 4, 5, 3, 2]      k = 3
index = [0, 1, 1, 1, 1]      (after Exscan for elements < a[k])

if (a[i] < a[k]) // i = {0,4}
aa[index[i]] = a[i];
\_______/    \___/
{0,1}      {1,2}aa = [1,2]

find equal

         0  1  2  3  4       n = 5 (indices)
a =     [1, 4, 5, 3, 2]      k = 3
index = [0, 0, 0, 0, 1]      (after Exscan for elements = a[k])

if (a[i] == a[k]) // i = {3}
aa[index[i]] = a[i];
\_______/    \___/
{0}        {3}ab = [3]

find larger

         0  1  2  3  4       n = 5 (indices)
a =     [1, 4, 5, 3, 2]      k = 3
index = [0, 0, 1, 2, 2]      (after Exscan for elements > a[k])

if (a[i] == a[k]) // i = {1,2}
aa[index[i]] = a[i];
\_______/    \___/
{0,1}      {4,5}ac = [4,5]

Runtime analysis

Running time: $\small T_\infin$ (if choice of pivot is good - with max $\small n/2$ elements in both segments)

Sequential

$\small T_\infin(n) \leq T_\infin(n/2) + O(n)$

Solution: $\small T_\infin(n) \in O(n)$

Parallel

$\small T_\infin(n) \leq T_\infin(n/2) + O(\log n)$

Solution: $\small T_\infin(n) \in O(\log(n)^2)$

Speedup

$\small O(n/ \log(n))$

Blocking technique

Use when applicable (not always applicable).

Goal: reducing communication / synchronization steps.

Blocking technique (general view)
1. Divide into sub-problems of size $\small n/p$ that can be solved sequentially or in parallel.
1. Assign subproblem to each processor.
  Step 1 and 2 should be fast: ie. $\small O(1)$ , $\small O(\log n)$ , ...
1. Solve subprolems using sequential algorithm.
  Should be perfectly parallelizable: ie. $\small O(n/p)$ 
1. Use parallel algorithm to combine subproblem results.
  Should be fast: ie. $\small O(\log p)$ 
  Total cost must be less than $\small O(n/p)$ 
1. Apply combined result to subproblem solutions.
  Should be perfectly parallelizable

Blocking technique (alternative view)

Use work-optimal algorithm to shrink algorithm
Make blocks of size $\small \Theta(n/p)$ using $\small O(f(p,n))$ time steps per processor.

Use fast and possibly non work-optimal algorithm on shrunk problem
Solve $\small p$ subproblems in parallel.

Unshrink, compute final solution with work-optimal algorithm
Combine into final solution using $\small O(g(p,n))$ time steps per processor.

Usually we can get from $\small O(n)$ to $\small O(n/ \log n)$

Resulting parallel algorithm is cost-optimal if $\small f(p,n)\in O(n/p)$ and $\small g(p,n) \in O(n/p)$ .

Blocking technique for prefix-sums algorithms (in depth)

Processors compute block in array locally without synchronization.
Each processor $\small i$ has block of $\small n/p$ elements:
$\small x[j]$ where $\small j = i \cdot n/p,~\dots,~(i+1)\cdot n/p-1$
$\small O(1)$
$\small T= n/p$
$\small W = n$

Exscan(y, p)
takes $\small O(\log p)$ communication rounds / synchronization steps and $\small O(p)$ work.
Does not need to be work-optimal.
$\small T = O(n/p)$
$\small W = O(n)$ with $\small p$ processors — ie. $\small O(\log p)$ for $\small p$ in $\small O(n / \log n)$

Processor $\small i$ adds exclusive prefix sum $\small y[i]$ to all $\small x[j]$
Completing by local postprocessing of blocks.
$\small T= n/p$
$\small W = n$

Invariant

Total work ( $\small +$ operations) is $\small 2n+p \log p$ .

This is at least twice that of $\small \text{Tseq}(n)$ .

Invariant

Possibly better performance if reduction is done first with prefix sums following after.

Prefix first $\small 2n$ reads, $\small 2n$ writes per block

Reduction first $\small 2n$ reads, $\small n$ writes per block

But both are $\small \geq(2n-1)$ operations with $\small +$ .