Bucket-sort

Sequential algorithm

Explanation:https://www.youtube.com/watch?v=ELrhrrCjDOA

Given: $\small n$ integers in range $\small [0, R-1]$ stored in array $\small A$ .

Count number of elements of each “key” (magnitude)

for (i=0; i<R; i++) bucket[i] = 0; // initialize range
for (i=0; i<n; i++) bucket[A[i]]++; // count occurences of each value in range

Compute start index of each bucket with exclusive prefix-sums

Implementation

for (i=1; i<R; i++) bucket[i] += bucket[i-1]; // add previous value to this one
// shift all entries to right
for (i=R-1; i>0; i--) bucket[i] = bucket[i-1];
bucket[0] = 0;

Exscan(bucket);

Put sorted $\small A$ into $\small B$
```
for (i=0; i<n; i++) {
int tmp = bucket[A[i]]; // value in prefixed bucket
B[tmp] = A[i];
}
```
Now $\small B$ contains the sorted list.

Finally: copy $\small B$ back to $\small A$ if necessary

Analysis

This implementation is stable (relative order of elements is preserved).

Parallel algorithm

Works just as above, but we do the counting and Exscan part distributedly.

$\small n$ integers distributed evenly across $\small p$ processes

$\small m = n/p$ per process

Find occurences of elements and store them in B(previously calledbucket)

Call Allreduce

can be optimized by usingMPI_BCast

MPI_Allreduce(
B, // local array with number of occurences
AllB, // distributed array with number of occurences (of all processes)
R, // size equal to R (range)
MPI_INT, MPI_SUM, // addition
comm
);

Now vector AllB contains the sizes of all buckets over all processes (stored in all processes).

Call Exscan(B)(distributedly)
```
MPI_Exscan(
B, // local array with number of occurences
RelB, // output
R, // size equal to R (range)
MPI_INT, MPI_SUM, // addition
comm
);
```
RelB is the relative position of every ranks elements in their buckets.
It is different in every process and is calculated distributedly.

Call Exscan(AllB)(locally)
```
Exscan(AllB)
```
j' = number of local elements with key A[j] before j
Local element A[j] needs to go to position AllB[A[j]]+RelB[A[j]]+j' (for A[j]>0 ).
💡
Now we have the required data to sort the total array and only have to redistribute the data.

Compute number of elements to be sent to each process: sendelts[i] with $\small i\in [0;~p-1]$

Call Alltoall
all → all (make data of all processes available to eachother)
This opeartion transposes the data.
```
MPI_Alltoall(
sendelts,	1, MPI_INT, // send
recvelts,	1, MPI_INT, // receive
comm
);
```

Call Alltoallv

Transpose again

MPI_Alltoallv(
A,sendelts,sdispls,MPI_INT, // send
C,recvelts, // receive
…,comm
);

Finally: Reorder elements from C back to A

If Analysis

If the implementation is stable then it is “Radixsort”