📎

Quicksort

Quicksort bottleneck: Partitioning

Idea:

  1. All processes get pivot element and partition locally.
  1. Exclusive scan to compute indices of elements smaller, equal, larger than pivot
  1. Redistribution to new distributed array

    Split set of processes into those with small and large elements, recurse

Initial data distribution

Each process gets block of size n/p\small n/p

Intermediate result

Each process has sorted its block but its size is not n/p\small n/p

Result

Each process has sorted its block of size n/p\small n/p

Algorithm 1) Using hypercube algorithm

  1. Choose pivot, distribute to all processes

    process 0 chooses pivot, then it distributed to all Bcast

    (alternatively: each process chooses pivot and the median gets calculated withAllgather)

  1. Local partition

    Each process finds elements smaller, equal, larger than pivot element.

  1. Global partition: paiwise (hypercube) exchange
    • Even numbered process i sends “larger” group to process i+1 and receives “smaller” group from it.
    • Odd numbered process i receives “larger” group from i-1 and sends back its own “smaller” group.
  1. Split communicator, recurse

    Then MPI_Comm_split with color rank%2 is used:

    • Even numbered processes work on smaller pivot elements.
    • Odd numbered processes work on larger pivot elements.

    After log2(p)\small \log_2(p) steps each process is in communicator by itself.

Drawbacks:

  • Only useful if processor number is a power of two
  • load balance might be arbitrarily bad, one processor could do all the work if pivot is bad

Analysis: Assuming exact median pivot is found

T(n,p)=O(log(p)²)+O(n/plog(n))\small T(n,p) = O(\log(p)²) + O(n/p \cdot \log(n))

Speedup: O(p)\small O(p) for np\small n \gg p .

Algorithm 2) Using compaction

  1. Choose pivot, distribute to all processes\dots
  1. Local partition\dots
  1. Global partition: paiwise (hypercube) exchange
    • Process i computes size of “smaller” group in all processes with MPI_Exscan with result in mi\small m_i .
    • Elements in “smaller” group get sent to process mi/(n/p)\small m_{i} /(n / p) and possibly also mi/(n/p)+1\small m_{i} /(n / p)+1 .

      All but before that the sending processes have to inform their receivers viaAlltoallhow many elements are getting sent.

    • Redistribution through Alltoallv or send and receive.

    repeat for “larger” group \circlearrowleft

  1. Split communicator, recurse\dots

    Then MPI_Comm_split with color rank<middle?0:1 is used:

    After log2(p)\small \log_2(p) steps each process is in communicator by itself: sort locally.

Analysis: same as above but better load balance