Prefix-sums, reduction

Collective opereation (pattern)

Reduction and prefix-sums are collective operations.

A set of processors cooperatively carrying out an operation on a data-set.

  • Other examples

    Broadcast One processor has data that all the others receive

    Scatter One processor has data that all the other’s receive in their blocks

    Gather Blocks from all processors get collected by one processor

    Allgather Blocks from all processors get collected by all processors

    also called: broadcast-to-all

    Alltoall Each processor collects block from other processor

Definition: Reduction

Reduction problem

Given sequence x=(x0,x1,x2,,xn1)\small x= (x_{0},~x_{1},~x_{2},~\dots, x_{n-1})

elements can be: integers, real numbers, vectors, ...

and associative operation +\small +

has algebraic property:x+(y+z)=(x+y)+z\small x+(y+z)=(x+y)+z

compute

y=0i<nxi=x0+x1+x2+xn1 \small y = \sum_{0\leq i < n} x_i = x_{0}+ x_{1}+x_{2}+\dots x_{n-1} 


  • Examples: Reduction as collective operation
    • Reduction-to-one

      All processors participate. Result stored in “root”-processor.

    • Reduction-to-all

      All processors participate. Result available to all.

    • Reduction-with-scatter

      Using vectors. Result stored in blocks over the processors according to some rule.

Definition: Prefix-sums

Prefix sums problem

Given sequence x\small x compute all inclusive/exclusive prefix sums (y0,y1,,yn1)\small (y_0,~y_1,~\dots,~y_{n-1}) .

i>0\footnotesize i > 0 and there is a special definition for i=0\small i = 0

exscan(i,n)\small \tt exscan(i,n)

Exclusive prefix sum, computed by process i\small i

(can’t be derived from inclusive sum unless+\small +has an inverse operation).

yi=0j<ixj=x0+x1+x2+xi1 \small y_i = \sum_{0\leq j \textcolor{pink}{<i}} x_j = x_{0}+ x_{1}+x_{2}+\dots x_{\textcolor{pink}{i-1}} 

in O(n/p+logn)\color{pink}\small O(n/p + \log n)

scan(i,n)\small \tt scan (i,n)

Inclusive prefix sum (including xi\small x_i ), computed by process i\small i

(same as exclusive prefix sum +xi\small x_i)

yi=0jixj=x0+x1+x2+xi \small y_i = \sum_{0\leq j \textcolor{pink}{\leq i}} x_j = x_{0}+ x_{1}+x_{2}+\dots x_{\textcolor{pink}{i}} 

Theorem: Speedup is at mostp/2\small p/2

For computing the prefix-sums of an n\small n input sequence, the following trade-off holds between:

s\small s size, number of +\small + operations

t\small t depth, T\small T_\infin

s+t2n2\small s+t \geq 2n-2

Sequential solution

Solution to both reduction and prefix sums problem.

This is the best possible, since output depends onxi\small x_i.

Work Wseq(n)O(n)\color{pink}\small \text{Wseq}(n) \in O(n) total operations

Fastest possib. parallel T(n)O(n)\color{pink}\small T_\infin(n) \in O(n)

applications of +\footnotesize +(n1)\color{pink}\small (n-1)

time complexity Tseq(n)Θ(n)\color{pink}\small \text{Tseq}(n)\in\Theta(n )


  • implementation (not the best)
    for (i=1; i<n; i++) {
    x[i] = x[i-1] + x[i];
    }

    2(n1)\small 2(n-1) memory reads

    Tseq(n)=n1=O(n)\small \text{Tseq}(n) = n-1 = O(n) summations

  • implementation - improved

    1 read, 1 write per iteration

    optimizer in compiler can improve performance significantly

    register int sum = x[0];  // compiler stores in register for faster access
    for (i=1; i<n; i++) {
    sum += x[i];  // read
    x[i] = sum;   // write
    }

Parallel solution (intuitively)

Solution to both reduction and prefix sums problem.

What we ideally want:

Work Wpar(n)Wseq(n)O(n) \color{pink}\small \text{Wpar}(n) \in \text{Wseq}(n) \in O(n) 

Fastest poss. parallel T(n)O(logn)\color{pink}\small T_\infin(n) \in O(\log n)

applications of +\footnotesize + close to (n1)\color{pink}\small (n-1)

time complexity Tpar(n)O(n/p+T(n)) \color{pink}\small \text{Tpar}(n) \in O(n/p+ T_{\infin}(n))  for large range of p\small p

In most architecturesΩ(logn)\small \Omega(\log n)is reasonable for a work-optimal solution.

We want to make use of the associativity of the operator:

There are 3 parallel solutions to inclusive prefix sums:

  1. recursive fast, work-optimal
  1. iterative fast, work-optimal
  1. doubling faster, not work-optimal but still useful

1) Recursive

Recursive solution

Sum pairwise, then recurse on smaller problems.

Scan(x,n) {
if (n==1) {
return;
}for (i=0; i<n/2; i++) {
y[i] = x[2*i] + x[2*i+1];
}// solve recursively in y -> implicit (with fork-join) or explicit barrier
Scan(y,n/2);// take back
x[1] = y[0];
for (i=1; i<n/2; i++) {
x[2*i] = y[i-1] + x[2*i];
x[2*i+1] = y[i];
}if (odd(n)) {
x[n-1] = y[n/2-1] + x[n-1];
}
}

Summary

Work W(n)O(n)\color{pink}\small W(n) \in O(n)

Time Tpar(p,n)=W(n)/p+T(n)=O(n/p+logn) \textcolor{pink}{\small \text{Tpar}(p,n) = }W(n)/p + T_\infin(n) = \textcolor{pink}{O(n/p+\log n)} 

Parallelism Tseq(n)T(n)=n/logn \color{pink} \large \frac{\text{Tseq}(n)}{T_\infin(n)} \small = n/\log n  processors

Parallel steps T(n)=2logn\color{pink}\small T_\infin(n) = 2 \log n parallel steps (recursive calls)

2 synchronizations per recursive call

+\small + operations 2nO(n)\color{pink}\small 2n \in O(n)


Advantages

  • Smaller y\small y array might fit in cache
  • pair-wise summing has good spatial locality

Drawbacks

  • space: extra n/2\small n/2 sized array in each recursive call ( n\small n in total)
  • About 2n\small 2n operations of +\small +(compared to sequential:n1\small n-1)
  • 2(log2(n))\small 2(\log_2(n)) parallel steps

2) Iterative

Eliminates recursion and extra space.

Can only compute sum for arrays with n=2k\small n= 2^k elements where k=0,1,,log2(n)1\small k = 0, 1, \dots, \log_2(n)-1

in log2(n)\small \log_2(n) rounds:

In round i\small i , the +\small + operation is done for every 2i+1\small 2^{i+1} th element.

This way:

x[2k1]=0i<2kxi \small x[2k-1] = \sum_{0 \leq i <\textcolor{pink}{2^k}} x_i 

Lemma

Reduction can be performed out in r=log2(n)\small r = \log_2(n) synchronized rounds for n\small n (if it’s a power of two).

Total number of +\small + operations are \smalln/2+n/4+n/8+<n(=n1)\small n/2+ n/4 + n/8 + \dots < n ~(= n-1)

Geometric series0inari=a1rn+11r \small \sum_{0 \leq i \leq n} a\cdot r^i = a\cdot \large \frac{1-r^{n+1}}{1-r} 

Iterative solution

for (k=1; k<n; k=kk) {
kk = k<<1; // double
par (i=kk-1; i<n, i+=kk) {
x[i] = x[i] + x[i-k];
}barrier;
}
for (k=1; k<n; k=kk) {
kk = k<<1; // halve
par (i=kk-1; i<n, i+=kk) {
x[i] = x[i] + x[i-k];
}barrier;
}

Parallelization of inner loop

The inner for-loop can be parallelized (not the outer one).

This way we can get from 2k+12^{k+1} operations in each round to n/2n+1\small n/2^{n+1} .

Total work 2nO(Tseq(n))\small \approx 2n \in O(\text{Tseq}(n))

Factor 2\small 2 is off from sequential n1\small n-1 work.

Invariants

  • Invariant for Up-Phase

    x[i]=X[i2k+1]++X[i]\small x[i] = X[i-2^k+1]+\dots + X[i]

    j=1,2,3,\small j = 1, 2, 3, \dots

    i=j2k1\small i = j\cdot 2^k-1

    k=0,1,,log2(n) \small k = 0, 1, \dots, \lfloor \log_2(n)\rfloor 

  • Invariant for Down-Phase

    x[i]=0jiX[i]\small x[i] = \sum_{0 \leq j \leq i} X[i]

    j=1,2,3,\small j = 1, 2, 3, \dots

    i=j2k1\small i = j\cdot 2^k-1

    k=log2(n),log2(n)1, \small k = \lfloor \log_2(n)\rfloor, \lfloor \log_2(n)\rfloor-1, \dots 

Distributed programming models

Synchronization can get very expensive for largen\small nrelative top\small pin in distributed memory models.

2log2(n)\small 2\cdot \log_2(n) communication rounds each with n/2k\small n/2^k\small concurrent communication operations.

2n\small 2n operations in total.

Since often np\small n \gg p this is too expensive.

Summary

Work W(n)2nO(Tseq(n)) \color{pink}\small \text{W}(n)\approx 2n \in O(\text{Tseq}(n)) 

Time 2log2(n)\small 2\cdot \lfloor\log_2(n)\rfloor rounds each O(n/2n+1)\small O(n/2^{n+1}) = O(n/2nlogn)\color{pink}\small O(n/2^{n} \cdot \log n)

Linear speedup Sp(n)O(p/2)\color{pink}\small S_p(n) \in O(p/2) - half the processors are lost

+\small + operations about 2nO(n)\color{pink}\small 2n \in O(n)


Advantages

  • In-place (extra space required: input and output are the same array)
  • Work-optimal, simple, parallelizable loops
  • No recursive overhead

Disadvantages

  • Less cache-friendly than recursive solution - less spacial locality with increasing step-size.
  • 2log2(n)\small 2\cdot \lfloor\log_2(n)\rfloor rounds
  • About 2n\small 2n operations with +\small +

3) Doubling

Faster butnot work optimalsolution.

In each round every processor computes: we are not using every k\small k th processor as in the previous solution.

Invariance

x[i]=x[ik]+x[i]\small x[i] = x[i-k] + x[i]

k=2k\small k = 2^{k'}


Round k\small k'

with log2n\small \lceil \log_2n\rceil rounds needed in total.

int *y = (int)malloc(n*sizeof(int));
int *t;for (k=1; k<n; k=k*2) {
par (i=0; i<k; i++) y[i] = x[i]; // skip first elements
par (i=k; i<n; i++) y[i] = x[i-k]+x[i]; // add
barrier;
t = x; x = y; y = t; // swap x, y
}

Invariant

Before each iteration of step k\small k :

i:x[i]=max(0,i2k1)jiX[j] \small \forall i : x[i] = \sum_{\max(0, i-2^k-1)\leq j \leq i }X[j] 

Summary

Advantages

  • Only log2p\small \lceil\log_2 p \rceil rounds (and synchronization / communication steps)
  • Simple parallelizable loops
  • no recursive overheads

Drawbacks

  • not work-optimal
  • less spatial locality, less cache friendly with increasing step-size k\small k
  • Extra array of size n\small n needed to eliminate dependencies