Quicksort
Quicksort bottleneck: Partitioning
Idea:
- All processes get pivot element and partition locally.
- Exclusive scan to compute indices of elements smaller, equal, larger than pivot
-
Redistribution to new distributed array
Split set of processes into those with small and large elements, recurse
Initial data distribution
Each process gets block of size
Intermediate result
Each process has sorted its block but its size is not
Result
Each process has sorted its block of size
Algorithm 1) Using hypercube algorithm
- Choose pivot, distribute to all processes
process
0chooses pivot, then it distributed to allBcast(alternatively: each process chooses pivot and the median gets calculated with
Allgather)
- Local partition
Each process finds elements smaller, equal, larger than pivot element.
- Global partition: paiwise (hypercube) exchange
-
Even numbered process
isends “larger” group to processi+1and receives “smaller” group from it.
-
Odd numbered process
ireceives “larger” group fromi-1and sends back its own “smaller” group.
-
Even numbered process
- Split communicator, recurse
Then
MPI_Comm_splitwith colorrank%2is used:- Even numbered processes work on smaller pivot elements.
- Odd numbered processes work on larger pivot elements.
After steps each process is in communicator by itself.
Drawbacks:
- Only useful if processor number is a power of two
- load balance might be arbitrarily bad, one processor could do all the work if pivot is bad
Analysis: Assuming exact median pivot is found
Speedup: for .
Algorithm 2) Using compaction
- Choose pivot, distribute to all processes
- Local partition
- Global partition: paiwise (hypercube) exchange
-
Process
icomputes size of “smaller” group in all processes withMPI_Exscanwith result in .
-
Elements in “smaller” group get sent to process
and possibly also
.
All but before that the sending processes have to inform their receivers via
Alltoallhow many elements are getting sent.
-
Redistribution through
Alltoallvor send and receive.
repeat for “larger” group
-
Process
- Split communicator, recurse
Then
MPI_Comm_splitwith colorrank<middle?0:1is used:After steps each process is in communicator by itself: sort locally.
Analysis: same as above but better load balance