Applications prefix-sums, reduction
Applications of reduction
Cutoff computation
Cutoff computation
done = allreduce(localdone, AND);
localdone
=
(convergence check)
allare local
here
AND
is the associative operation, and we distribute result to all processes (
Reduction-to-all
).
// Parallelizable part
do {
for (i=0; i<n; i++) {
x[i] = f(i);
}
// convergence check
done = allreduce(localdone, AND);
} while (!done)
Applications of prefix-sums
Array compaction: Load balancing
Sequential solution
Input arrays:
a
,
active
for (i=0; i<n; i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}
Work:
where
|active|
ist the number of indices set to
true
Solution: Without array compaction
There are no dependencies between loop iterations → Data parallel computation.
par (0 <= j < p) {
// i: j*(n/p) ---> (j+1)*(n/p)-1
for (i=j*(n/p); i<(j+1)*(n/p); i++) {
if (active[i]) {
a[i] = f(b[i]+c[i]);
}
}
}
Work:
Time:
Static assigment of work to processors → load inbalance: processors with
active[i] == false
do little work.
Solution: With array compaction
Lower work to by iterating only over active indices:
Split into blocks of size
par (i=0; i<n; i++) {
index[i] = active[i] ? 1 : 0;
}Exscan(index,n); // exclusive prefix computationm = index[n-1] + (active[n-1] ? 1 : 0); // m = |active| -> used belowpar (i=0; i<n; i++) {
if (active[i]) {
compact[index[i]] = i;
}
}
par (j=0; j<m; j++) {
i = compact[j];
a[i] = f(b[i]+c[i]);
}
Example
0 1 2 3 4 -> n = 5 active = [true, true, false, false, true] -> m = 3 index = [ 1, 1, 0, 0, 1 ] ↓ Exscan index = [ 0, 1, 2, 2, 2 ]
Now we can copy values with their index for the new compact array in
index
.
Work:
Time:
where
Array compaction: Partitioning for QuickSort
Quicksort algorithm
Quicksort(a, n)
- Select pivot
-
Partition
into
elements < pivot
elements = pivot
elements > pivot
-
In parallel:
Parallel partitioning algorithm
par (i=0; i<n; i++) {
index[i] = (a[i]<a[k]) ? 1 : 0;
}Exscan(index,n); // exclusive prefix computation
n1 = index[n-1]+(active[n-1] ? 1 : 0); // num of smaller elementspar (i=0; i<n; i++) {
if (a[i]<a[k]) {
aa[index[i]] = a[i]; // store value of each element smaller than a[k]
}
}// same for equal to and larger than pivot elements
...// copy back
par (i=0; i<n1; i++) {
a[i] = aa[i];
}// same for equal to and larger than pivot elements
...
Example: getting 3 partitions
find smaller
0 1 2 3 4 n = 5 (indices) a = [1, 4, 5, 3, 2] k = 3 index = [0, 1, 1, 1, 1] (after Exscan for elements < a[k])
if (a[i] < a[k]) // i = {0,4} aa[index[i]] = a[i]; \_______/ \___/ {0,1} {1,2}aa = [1,2]
find equal
0 1 2 3 4 n = 5 (indices) a = [1, 4, 5, 3, 2] k = 3 index = [0, 0, 0, 0, 1] (after Exscan for elements = a[k])
if (a[i] == a[k]) // i = {3} aa[index[i]] = a[i]; \_______/ \___/ {0} {3}ab = [3]
find larger
0 1 2 3 4 n = 5 (indices) a = [1, 4, 5, 3, 2] k = 3 index = [0, 0, 1, 2, 2] (after Exscan for elements > a[k])
if (a[i] == a[k]) // i = {1,2} aa[index[i]] = a[i]; \_______/ \___/ {0,1} {4,5}ac = [4,5]
Runtime analysis
Running time: (if choice of pivot is good - with maxelements in both segments)
Sequential
Solution:
Parallel
Solution:
Speedup
Blocking technique
Use when applicable (not always applicable).
Goal: reducing communication / synchronization steps.
Blocking technique (general view)
- Divide into sub-problems of size that can be solved sequentially or in parallel.
-
Assign subproblem to each processor.
Step 1 and 2 should be fast: ie.,, ...
-
Solve subprolems using sequential algorithm.
Should be perfectly parallelizable: ie.
-
Use parallel algorithm to combine subproblem results.
Should be fast: ie.
Total cost must be less than
-
Apply combined result to subproblem solutions.
Should be perfectly parallelizable
Blocking technique (alternative view)
-
Use work-optimal algorithm to shrink algorithm
Make blocks of sizeusingtime steps per processor.
-
Use fast and possibly non work-optimal algorithm on shrunk problem
Solvesubproblems in parallel.
-
Unshrink, compute final solution with work-optimal algorithm
Combine into final solution usingtime steps per processor.
Usually we can get from to
Resulting parallel algorithm is cost-optimal if and .
Blocking technique for prefix-sums algorithms (in depth)
-
Processors compute block in array locally without synchronization.
Each processor has block of elements:
where
Exscan(y, p)
takes communication rounds / synchronization steps and work.
Does not need to be work-optimal.
withprocessors — ie.forin
-
Processor
adds exclusive prefix sum
to all
Completing by local postprocessing of blocks.
Invariant
Total work ( operations) is .
This is at least twice that of .
Invariant
Possibly better performance if reduction is done first with prefix sums following after.
- Prefix first reads, writes per block
- Reduction first reads, writes per block
But both are operations with .