📎

Applications prefix-sums, reduction

Applications of reduction

Cutoff computation

Cutoff computation

done = allreduce(localdone, AND);

localdone = i:x[i]<ε\small \forall i: x[i] < \varepsilon(convergence check)

alli\small iare local

here AND is the associative operation, and we distribute result to all processes ( Reduction-to-all ).

// Parallelizable part
do {
for (i=0; i<n; i++) {
x[i] = f(i);
}
// convergence check
done = allreduce(localdone, AND);
} while (!done)

Applications of prefix-sums

Array compaction: Load balancing

Sequential solution

Input arrays: a , active

for (i=0; i<n; i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}

Work: O(n+activef)\small O(n + |\texttt{active}| \cdot f)

where |active| ist the number of indices set to true

Solution: Without array compaction

There are no dependencies between loop iterations → Data parallel computation.

par (0 <= j < p) {
// i: j*(n/p) ---> (j+1)*(n/p)-1
for (i=j*(n/p); i<(j+1)*(n/p); i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}
}

Work: O(n)+O(activef)\small O(n) + O( |\texttt{active}| \cdot f)

Time: O(activef)\small O( |\texttt{active}| \cdot f)

Static assigment of work to processors → load inbalance: processors with active[i] == false do little work.

Solution: With array compaction

Lower work to O(activef)\small O(|\texttt{active}| \cdot f) by iterating only over active indices:

Split into blocks of size active/p\small |\texttt{active}|/p

par (i=0; i<n; i++) {
index[i] = active[i] ? 1 : 0;
}Exscan(index,n); // exclusive prefix computationm = index[n-1] + (active[n-1] ? 1 : 0); // m = |active| -> used belowpar (i=0; i<n; i++) {
if (active[i]) {
compact[index[i]] = i;
}
}
par (j=0; j<m; j++) {
i = compact[j];
a[i] = f(b[i]+c[i]);
}
  • Example
                0     1      2      3     4      -> n = 5
    active  = [true, true, false, false, true]   -> m = 3
    index   = [ 1,    1,     0,     0,    1  ]
    ↓ Exscan
    index   = [ 0,    1,     2,     2,    2  ]

    Now we can copy values with their index for the new compact array in index .

Work: O(n)+O(activef)\small O(n) + O(|\texttt{active}| \cdot f)

Time: O(n/p)+Exscan(p,n)+O((activef)/p) \small O(n/p) + \text{Exscan}(p,n) + O((|\texttt{active}| \cdot f)/p) 

where Exscan(p,n)=O(n/p+logn)\small \text{Exscan}(p,n) = O(n/p + \log n)

Array compaction: Partitioning for QuickSort

Quicksort algorithm

Quicksort(a, n)

  1. Select pivot a[k]\footnotesize \tt a[k]
  1. Partition a\footnotesize \tt a into

    a[0,...,n11]\footnotesize \tt a[0,...,n_1-1] elements < pivot

    a[n1,...,n21]\footnotesize \tt a[n_1,...,n_2-1] elements = pivot

    a[n2,...,n1]\footnotesize \tt a[n_2,...,n-1] elements > pivot

  1. In parallel:

    Quicksort(a,n1)\footnotesize \tt Quicksort(a,n_1)

    Quicksort(a+n2,nn2)\footnotesize \tt Quicksort(a+n_2,~n-n_2)

Parallel partitioning algorithm

par (i=0; i<n; i++) {
index[i] = (a[i]<a[k]) ? 1 : 0;
}Exscan(index,n); // exclusive prefix computation
n1 = index[n-1]+(active[n-1] ? 1 : 0); // num of smaller elementspar (i=0; i<n; i++) {
if (a[i]<a[k]) {
aa[index[i]] = a[i]; // store value of each element smaller than a[k]
}
}// same for equal to and larger than pivot elements
...// copy back
par (i=0; i<n1; i++) {
a[i] = aa[i];
}// same for equal to and larger than pivot elements
...
  • Example: getting 3 partitions

    find smaller

             0  1  2  3  4       n = 5 (indices)
    a =     [1, 4, 5, 3, 2]      k = 3
    index = [0, 1, 1, 1, 1]      (after Exscan for elements < a[k])
    if (a[i] < a[k]) // i = {0,4}
    aa[index[i]] = a[i];
    \_______/    \___/
    {0,1}      {1,2}aa = [1,2]

    find equal

             0  1  2  3  4       n = 5 (indices)
    a =     [1, 4, 5, 3, 2]      k = 3
    index = [0, 0, 0, 0, 1]      (after Exscan for elements = a[k])
    if (a[i] == a[k]) // i = {3}
    aa[index[i]] = a[i];
    \_______/    \___/
    {0}        {3}ab = [3]

    find larger

             0  1  2  3  4       n = 5 (indices)
    a =     [1, 4, 5, 3, 2]      k = 3
    index = [0, 0, 1, 2, 2]      (after Exscan for elements > a[k])
    if (a[i] == a[k]) // i = {1,2}
    aa[index[i]] = a[i];
    \_______/    \___/
    {0,1}      {4,5}ac = [4,5]

Runtime analysis

Running time: T\small T_\infin(if choice of pivot is good - with maxn/2\small n/2elements in both segments)

Sequential

T(n)T(n/2)+O(n)\small T_\infin(n) \leq T_\infin(n/2) + O(n)

Solution: T(n)O(n)\small T_\infin(n) \in O(n)

Parallel

T(n)T(n/2)+O(logn)\small T_\infin(n) \leq T_\infin(n/2) + O(\log n)

Solution: T(n)O(log(n)2)\small T_\infin(n) \in O(\log(n)^2)

Speedup

O(n/log(n))\small O(n/ \log(n))

Blocking technique

Use when applicable (not always applicable).

Goal: reducing communication / synchronization steps.

Blocking technique (alternative view)

  1. Use work-optimal algorithm to shrink algorithm

    Make blocks of sizeΘ(n/p)\small \Theta(n/p)usingO(f(p,n))\small O(f(p,n))time steps per processor.

  1. Use fast and possibly non work-optimal algorithm on shrunk problem

    Solvep\small psubproblems in parallel.

  1. Unshrink, compute final solution with work-optimal algorithm

    Combine into final solution usingO(g(p,n))\small O(g(p,n))time steps per processor.


Usually we can get from O(n)\small O(n) to O(n/logn)\small O(n/ \log n)

Resulting parallel algorithm is cost-optimal if f(p,n)O(n/p)\small f(p,n)\in O(n/p) and g(p,n)O(n/p)\small g(p,n) \in O(n/p) .

Blocking technique for prefix-sums algorithms (in depth)

  1. Processors compute block in array locally without synchronization.

    Each processor i\small i has block of n/p\small n/p elements:

    x[j]\small x[j] where j=in/p,,(i+1)n/p1 \small j = i \cdot n/p,~\dots,~(i+1)\cdot n/p-1 

    O(1)\small O(1)

    T=n/p\small T= n/p

    W=n\small W = n

  1. Exscan(y, p)

    takes O(logp)\small O(\log p) communication rounds / synchronization steps and O(p)\small O(p) work.

    Does not need to be work-optimal.

    T=O(n/p)\small T = O(n/p)

    W=O(n)\small W = O(n)withp\small pprocessors — ie.O(logp)\small O(\log p)forp\small pinO(n/logn)\small O(n / \log n)

  1. Processor i\small i adds exclusive prefix sum y[i]\small y[i] to all x[j]\small x[j]

    Completing by local postprocessing of blocks.

    T=n/p\small T= n/p

    W=n\small W = n

Invariant

Total work ( +\small + operations) is 2n+plogp\small 2n+p \log p .

This is at least twice that of Tseq(n)\small \text{Tseq}(n) .

Invariant

Possibly better performance if reduction is done first with prefix sums following after.

  1. Prefix first 2n\small 2n reads, 2n\small 2n writes per block
  1. Reduction first 2n\small 2n reads, n\small n writes per block

But both are (2n1)\small \geq(2n-1) operations with +\small + .