Quicksort
Quicksort bottleneck: Partitioning
Idea:
- All processes get pivot element and partition locally.
- Exclusive scan to compute indices of elements smaller, equal, larger than pivot
-
Redistribution to new distributed array
Split set of processes into those with small and large elements, recurse
Initial data distribution
Each process gets block of size
Intermediate result
Each process has sorted its block but its size is not
Result
Each process has sorted its block of size
Algorithm 1) Using hypercube algorithm
- Choose pivot, distribute to all processes
process
0
chooses pivot, then it distributed to allBcast
(alternatively: each process chooses pivot and the median gets calculated with
Allgather
)
- Local partition
Each process finds elements smaller, equal, larger than pivot element.
- Global partition: paiwise (hypercube) exchange
-
Even numbered process
i
sends “larger” group to processi+1
and receives “smaller” group from it.
-
Odd numbered process
i
receives “larger” group fromi-1
and sends back its own “smaller” group.
-
Even numbered process
- Split communicator, recurse
Then
MPI_Comm_split
with colorrank%2
is used:- Even numbered processes work on smaller pivot elements.
- Odd numbered processes work on larger pivot elements.
After steps each process is in communicator by itself.
Drawbacks:
- Only useful if processor number is a power of two
- load balance might be arbitrarily bad, one processor could do all the work if pivot is bad
Analysis: Assuming exact median pivot is found
Speedup: for .
Algorithm 2) Using compaction
- Choose pivot, distribute to all processes
- Local partition
- Global partition: paiwise (hypercube) exchange
-
Process
i
computes size of “smaller” group in all processes withMPI_Exscan
with result in .
-
Elements in “smaller” group get sent to process
and possibly also
.
All but before that the sending processes have to inform their receivers via
Alltoall
how many elements are getting sent.
-
Redistribution through
Alltoallv
or send and receive.
repeat for “larger” group
-
Process
- Split communicator, recurse
Then
MPI_Comm_split
with colorrank<middle?0:1
is used:After steps each process is in communicator by itself: sort locally.
Analysis: same as above but better load balance