Limits of parallel algorithms

Definition

Algorithms are defined in specified models, then implemented (for a specific parallel computer) .

Basics

$\small I(n)$ input of size $\small n$

$\small P(I)$ problem $\small P$ taking input $\small I$

$\small P(n)$ stands for $\small P(I(n))$

$\small p$ dedicaed parallel processor (we also pay for idle-time)

all start at the same time and end when the last processor finishes

$\small \text{Seq}$ sequential algorithm for specified problem and input

$\small \text{Par}$ parallel algorithm for specified problem and input

Time

$\color{pink}\small \text{Tseq}(n)$ time for $\small 1$ processor to solve $\small P(n)$ using $\small \text{Seq}$

best known worst-case time-complexity (or experimentally measured value)

$\color{pink}\small \text{Tpar}(p,n)$ time for $\small p$ processors to solve $\small P(n)$ using $\small \text{Par}$

$\color{pink}\small = \max_{0\leq i<p} T_i(n)$ (alternativelymincan be used)

Work

$\color{pink}\small \text{Wseq}(n) = \text{Tseq}(n)$ total number of executed instructions

$\color{pink}\small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)$ by all processors (not including idle / waiting times)

Work of $\small P$ the work of the best possible algorithm.

Cost

total time in which the processors are reserved (have to be paid for)

$\color{pink}\small C(n) = p\cdot \text{Tpar}(p,n)$

Absolute speedup

Absolute speed-up

Usually $\small n$ is kept fixed and $\small p$ gets varied.

Sp(n)=Tseq(n)Tpar(p,n)\color{pink}\small S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)}Sp(n)=Tpar(p,n)Tseq(n)

Upper boundary of processors
Implicitly there is an upper boundary of processors for which it does not make sense to add additional processors.
1. $\color{pink}\small S_p(n) = c_1\cdot p$
  for some $\small c_1 <1$
1. $\color{pink}\small S_p(n) = c_2\cdot \sqrt p$
  for some $\small c_2 <1$

Limit of absolute speedup

Linear speedup $\small S_p(n) \in O(p)$ can not be exceeded.

$\color{pink}\small \small \text{Tseq}(n) \leq p \cdot \text{Tpar}(p,n)$

Perfect speed-up is rare and hardly achievable.

Sometimes we experimentally get super-linear speedup but its because of randomization, non determinism or because algorithms are not comparable (also possible by using caches on multiple processors).

Proof through contradiction
Sequential simulation construction
(Wrongly) assume that we can simulate a parallel algorithm sequentially by:
1. executing one step of each process, for $\small C(n)$ iterations
1. all steps of a single process until a communication/synchronization point, then the steps of the next process and so on, ...
Proof through contradiction
Assume in $\small \text{Tsim}(n)$ simulates $\small \text{Tpar}(p,n)$
We know that:
$\small p \cdot \text{Tpar}(p,n) \geq \text{Tsim}(n)$
The simulation can not take more time than than $\small p\cdot (\max_{0\leq i<p} T_i(n))$
To assume that $\small S_p(n) > p$ for some $\small n$ :
$S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} > p$
implies $\small \small \text{Tseq}(n) > p \cdot \text{Tpar}(p,n)$
implies $\small \small \text{Tseq}(n) > p \cdot \text{Tpar}(p,n) \geq \text{Tsim}(n)$ (see above)
implies $\small \text{Tseq}(n) > \text{Tsim}(n)$
Contradiction:
$\small \text{Tseq}(n) > \text{Tsim}(n)$ would mean, that $\small \text{Tseq}(n)$ is the best known/possible time.
Therefore:
$S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \leq p$
$\small \small \text{Tseq}(n) \leq p \cdot \text{Tpar}(p,n)$

Cost-optimal, Work-optimal

Cost-optimal parallel algorithm

$\color{pink}\small C(n) = p \cdot \text{Tpar}(p,n) \in O(\text{Tseq}(n))$

has linear speedup.

Proof
Cost-optimal algorithm with
$\small p\cdot\text{Tpar}(p,n) = c\cdot\text{Tseq}(n) \in O(\text{Tseq}(n))$
implies $\small \text{Tpar}(p,n) = c\cdot\text{Tseq}(n)/p$
implies $\small S_p(n) = \text{Tseq}(n)/\text{Tpar}(p,n) = p/c$
The constant $\small c$ stands for the load imbalances and overheads.
The smaller $\small c$ is, the closer we are to the perfect speed-up.

Example 1
For $\small \text{Tseq}(n) \in O(n)$
any $\small \text{Tpar}(p,n) \in O(n/p)$ is cost-optimal.
Proof:
$\small \text{Tpar}(p,n) \in O(n/p)$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/p)$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/p) \leq p \cdot (c\cdot n/p) = c \cdot n \in O(n)$
for some constant $\small c$
implies $\small p \cdot \text{Tpar}(p,n) \in O(n)$

Example 2
For $\small \text{Tseq}(n) \in O(n)$
the parallel time $\small \text{Tpar}(p,n) \in O(n/\sqrt{p})$ is not cost-optimal.
Proof:
$\small \text{Tpar}(p,n) \in O(n/\sqrt{p})$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/\sqrt{p})$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/\sqrt{p})\leq p \cdot (c\cdot n/\sqrt p) = c \cdot n \cdot \sqrt{p} \notin O(n)$
for some constant $\small c$
implies $\small p \cdot \text{Tpar}(p,n) \notin O(n)$

Example 3
For $\small \text{Tseq}(n) \in O(n)$
the parallel time $\small \text{Tpar}(p,n) \in O(n/(p/\log p)) = O((n \log p)/p)$ is not cost-optimal.
Proof:
$\small \text{Tpar}(p,n) \in O(n/(p/\log p)) = O((n \log p)/p)$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O((n \log p)/p)$
implies $\small p \cdot \text{Tpar}(p,n) \in p \cdot O((n \log p)/p)\leq p \cdot (c \cdot (n \log p) / p)= c \cdot n \cdot \log p\notin O(n)$
for some constant $\small c$
implies $\small p \cdot \text{Tpar}(p,n) \notin O(n)$

Work-optimal parallel algorithm

$\color{pink}\small \text{Wpar}(n) \in O(\text{Wseq}(n)) = O(\text{Tseq}(n))$

has potential for linear speedup.

Proof
Lets say there is a work-optimal algorithm with:
$\small \text{Tpar}(p,n) = \max_{0\leq i<p} T_i(n)$
$\small \sum_{0\leq i <p} T_i(n) = \text{Tseq}(n)$
We schedule $\small \text{Par}$ on a smaller number of processors $\small p'$ in such a way that no processor has any “idle time” (pretty difficult to do this in practice) .
Then:
$\small \sum_{0\leq i <p'}T_i(n) = p' \cdot \text{Tpar}(p',n)\in O(\text{Tseq}(n))$
This way the algorithm is cost-optimal and therefore has linear speedup.

Break even

= $\small p$ required to be faster than sequential algorithm.

Example
Given $\small \text{DumbSort(n)}$ with $\small T(n)$ that can be perfectly parallalized:
$\small \text{Tpar}(p, n) \in O(n^2/p)$
$\small \text{Tseq}(n) \in \theta(n \log n)$
$\small S_p(n) = \text{Tseq}(n)/\text{Tpar}(p,n) = O(n \log n)/O(n^2/p) = O(p (\log n)/n)$
- Linear speedup for fixed $\small n$ .
- Not work-optimal: Speedup decreasing with $\small n$ .
Break even
Number of processors for parallel algorithm to be faster than sequential algorithm:
$\small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr n^2/p < n \log n \Lrarr n/p <\log n \Lrarr p > n/\log n$
$\small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr p > n/\log n$

Relative speedup

Expresses scalability = how well the processors get utilized.

SRelp(n)=Tpar(1,n)Tpar(p,n)\color{pink}\small SRel_p(n) = \frac{\text{Tpar}(1,n)}{\text{Tpar}(p,n)}SRelp(n)=Tpar(p,n)Tpar(1,n)

Relative speedup is good if $\small SRel_p(n) \in \Theta(p)$

Easier to to achieve than absolute.

Fastest running time / fastest possible parallel

= smallest possible running-time of $\small \text{Par}$ with arbitrarily many processors

$\small T_\infin(n) = \text{Tpar}(p',n)$

where $\small \forall p: \text{Tpar}(p',n) \leq \text{Tpar}(p,n)$

$\color{pink}\small \forall n,p: T_\infin(n) \leq \text{Tpar}(p,n)$

Used in parallelism:

SRelp(n)=Tpar(1,n)Tpar(p,n)≤Tpar(1,n)T∞(n)=Parallelism\small \textcolor{grey}{SRel_p(n) = \frac{\text{Tpar}(1,n)}{\text{Tpar}(p,n)} } \textcolor {pink} \leq \frac{\text{Tpar}(1,n)}{\text{T}_\infin(n)} \textcolor{gray}{= \text{Parallelism }}SRelp(n)=Tpar(p,n)Tpar(1,n)≤T∞(n)Tpar(1,n)=Parallelism

Relative speedup and work-optimality

If work-optimal, then absolute AND relative speedup are the same asymptotically:

$\small \text{Tpar}(1,n) = O(\text{Tseq}(n))$

Parallelism

= largest number of processors that can still give linear, relative speedup (adding more processors is pointless)

$\small \small SRel_p(n) \leq \text{Parallelism} \small < p'$

$\color{pink} \small \small SRel_p(n) < p'$

Parallelism=Tpar(1,n)T∞(n)\small \text{Parallelism }= \color{pink}\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)}Parallelism=T∞(n)Tpar(1,n)

Example
CRCW, Max finding algorithm
```
par (0<=i<n) b[i] = true; // set all to true
par (0<=i<n, 0<=j<n)
if (a[i] < a[j]) b[i] = false; // set those where a larger one exists to falsepar (0<=i<n)
if (b[i]) x = a[i];
```
$\small \text{Wpar}(p,n) \in O(n^2)$
$\small \text{Tseq}(n) = \text{Wseq}(n) \in O(n)$ (best known sequential algorithm)
Absolute speedup
Not work-optimal: $\small \text{Wpar}(n) \notin O(\text{Tseq}(n))$
$\small S_p(n) = \textcolor{green}{\text{Tseq}(n)}/\text{Tpar}(p,n) = O(\textcolor{green}n)/O(n^2/p) = O(p/n)$
Linear absolute speedup for fixed $\small n$ .
Absolute speedup decreasing with $\small n$ .
Break even
Number of processors for parallel algorithm to be faster than sequential algorithm:
$\small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr n^2/p < n \Lrarr p > n$
$\small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr p > n$
This is bad! We need as many processors as we the problem size.
Relative speedup
$\small SRel_p(n) =\textcolor{green}{ {\text{Tpar}(1,n)}}/{\text{Tpar}(p,n)} \small = O(\textcolor{green}{n^2}) / O(n^2/p) = O(p)$
Linear relative speedup.
Parallelism
$\small T_\infin(n)= O(1)$
$\large\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)} \small = O(n^2) / O(1) = O(n^2)$
Great parallelism.
Conclusion: This (terrible) parallel algorithm has
linear relative speedup for $\small p$ up to $\small n^2$ processors.

Overhead

Parallelization Overhead

= Work that does not have to be done by a sequential algorithm like cooridination (communication, synchronization) and redundant computation.

When there is no overhead we can parallelize perfectly: $\small \text{Tpar}(p,n) = \text{Tseq}(n)/p$

Each Processor has Task $\small T_i(n)$ and Overhead $\small O_i(n)$ in its work:

$\small \text{Wpar}(p,n) = \sum_{0\leq i <p} T_i(n) + O_i(n)$

Therefore

$\color{pink}\small \text{Tpar}(1,n) \geq \text{Tseq}(n)$

Communication overhead model

$\color{pink} \small \text{Toverhead}(p, n_i) = \alpha(p) + \beta \cdot n_i$ for processor $\small i$ :

$\small \alpha(p)$ latency, dependent on $\small p$

$\small \beta \cdot n_i$ cost per data item, that needs to be exchanged by processor $\small i$

for synchronization operations it’s $\small n_i = 0$ 

Cost-optimal

If cost-optimal

$\small p\cdot \text{Tpar}(p,n) = \textcolor{pink}k \cdot \text{Tseq}(n)$

absolute speedup still is linear (but not perfect)

$\small S_p(n) = \large{\frac{1}{\textcolor{pink}k}} \small \cdot p$

Load Balancing

Load imbalance

Differece betweeen max and min work+overhead.

$\color{pink}\small |\max (T_i(n) + O_i(n)) ~~-~~ \min (T_i(n) + O_i(n)) |$

Load balancing

Achieving an even amount of work for all processors $\small i,j$ :

$\color{pink}\small \text{Tpar}(i,n) \approx \text{Tpar}(j,n)$

Static, oblivious

Splitting problem in $\small p$ pieces, regardless of the input (except it’s size) .

Static, problem dependent, adaptive

Splitting problem in $\small p$ pieces, using the structure of the input.

Dynamic

Dynamically (during program execution) readjusting the work assigned to processors.

This means some overhead.

Amdahl’s law: Sequential bottleneck

When static, oblivious - load balancing:

A) $\small \text{Seq}$ perfectly parallelizable:

$\small \text{Tpar}(p,n) \in O(\text{Tseq}(n)/p)$

B) $\small \text{Seq}$ has a fraction $\small s(n)$ that can’t be parallelized dependens on $\small n$ :

$\small \text{Tseq}(n) = s(n) \cdot \text{Tseq}(n) + \overbrace{ (1-s(n)) \cdot \text{Tseq}(n) }^{\text{can be parallelized}}$

$\small \text{Tpar}(p,n) = s(n) \cdot \text{Tseq}(n) + \large \frac{1-s(n)}{p} \small \cdot \text{Tseq}(n)$

C) $\small \text{Seq}$ has a fraction $\small s$ that can’t be parallelized constant and independent of $\small n$ :

$\small \text{Tseq}(n) = (s+r)\cdot \text{Tseq}(n)$

$\small \text{Tpar}(p,n) \textcolor{pink}\geq s \cdot \text{Tseq}(n) + \large \frac{r}{p} \small \cdot \text{Tseq}(n)$

Max speedup gets very limited and the constant fraction becomes a “sequential overhead”.

Amdahl’s Law (parallel version): Sequential bottleneck / overhead

Let program of $\small \text{Seq}$ contain two fractions, independent of $\small n$ :

“perfectly parallelizable” fraction $\color{pink}\small r$

“purely sequential” fraction $\color{pink}\small s=(1-r)$ (can’t be parallelized at all)

The maximum achievable speedup (independent of $\small n$ ) :

$\color{pink}\small S_p(n) = 1/s$

Proof
$\small \text{Tseq}(n) = (s+r)\cdot \text{Tseq}(n)$
$\small \text{Tpar}(p,n) = s \cdot \text{Tseq}(n) + \large \frac{r}{p} \small \cdot \text{Tseq}(n)$
$\small S_p(n) = \text{Tseq}(n) / \text{Tpar}(n)= 1/(s+r/p)$
$\color{pink}\small S_p(n) = 1/(s+r/p)$
$\color{pink}\small \lim_{p ~\rarr~ \infin}~~ S_p(n) = 1/s$

We want to avoid Amdahl’s law at all cost.

I/O, initialization of global data structures, shared data structures, everything that takes $\small O(n)$ time and work

Examples

Sequential algorithms that could be a constant fraction:

Example: I/O and problem splitting
Total time: 10ns
1. Processor 0: Read input for some computation ? ns
1. Split problem into $\small n/p$ parts, send part $\small i$ to a processor $\small i$ ? ns
1. All processors $\small i$ : solve part 9ns
1. All processors $\small i$ : send partial solution back to processor 0 ? ns
Constant sequential fraction in 3 out of 4 steps limits the speedup.
$\small s = 1-(9ns/10ns) = 0.1$
$\small \lim_{p ~\rarr~ \infin}~~ S_p(n) = 1/s = 10$
When measuring parallelism we therefore often dont measure the I/O and problem splitting part.

Example: malloc
```
// Sequential initialization
// O(1)
x = (int*)malloc(n*sizeof(int));
...// Parallelizable part
do {
for (i=0; i<n; i++) {
x[i] = f(i);
}// check for convergence
done = …;} while (!done)
```
$\small K$ iterations before convergence.
Assumptions: convergence check and $\small \mathrm f(i)\in O(1)$
$\small \text{Tseq}(n) = \overbrace{\textcolor{green} 1 + K }^{\text{sequential part}}+ \overbrace{K\cdot n}^{\text{for-loop}}$
$\small \text{Tpar}(p,n) = \textcolor{green}1 + K + K\cdot n/p$
Sequential fraction:
$\small s=(1-r) = 1-(\frac{K\cdot n}{1+K+K\cdot n/p}) \approx 1/(1+n)$
Speedup:
$\small \lim_{p ~\rarr~ \infin, \textcolor{green}{~~n>p, ~ ~n ~\rarr~ \infin}}~~ \\\quad S_p(n) = 1/s = \xcancel \infin ~p$
since $\small p$ is our upper boundary
👉
A constant sequential part (not a constant fraction), does not limit the speedup

Example: calloc
```
// Sequential initialization
// O(n)
x = (int*)calloc(n*sizeof(int));
...// Parallelizable part
do {
for (i=0; i<n; i++) {
x[i] = f(i);
}// check for convergence
done = …;} while (!done)
```
$\small K$ iterations before convergence.
Assumptions: convergence check and $\small \mathrm f(i)\in O(1)$
$\small \text{Tseq}(n) = \overbrace{\textcolor{green} n + K }^{\text{sequential part}}+ \overbrace{K\cdot n}^{\text{for-loop}}$
$\small \text{Tpar}(p,n) = \textcolor{green}n + K + K\cdot n/p$
Sequential fraction:
$\small s=(1-r) = 1-(\frac{K\cdot n}{n+K+K\cdot n/p}) \approx 1/(1+K)$
Speedup:
$\small \lim_{p ~\rarr~ \infin}~~\\\quad S_p(n) = 1/s = 1+K$
👉
A sequential part that is a constant fraction (dependent on $\small n$ ) limits the speedup

Scaled speedup

Strictly non-parallelizable usually depends on $\small n$ as in $\small s(n)$ and decreases with rising $\small n$ .

Scaled speedup

Let program of $\small \text{Seq}$ contain two fractions dependent on $\small n$ :

“perfectly parallelizable” fraction $\color{pink}\small T(n)$

“purely sequential” fraction $\color{pink}\small t(n) =(1-T(n))$
This part decreases with rising $\small n$ . $\small \lim_{n \rarr \infin} t(n) = 0$

We can therefore reach linear speedup through increasing $\small n$

and the faster $\small t(n)/T(n)$ converges, the faster the speed-up converges.

$\small t(n)$ should increase more slowly with $\small n$ than $\small T(n)$ .

We get

$\small \text{Tseq}(n) = t(n) + T(n)$

$\small \text{Tpar}(p,n) = t(n) + T(n)/p$

$\color{pink}\small \lim_{n\rarr \infin} S_p(n) = p$

Proof
We know that
$\small \lim_{n \rarr \infin} t(n) = 0$
$\color{pink}\small \lim_{n\rarr \infin} \large \frac{t(n)}{T(n)} \small = 0$
Therefore
$\small S_p(n) = \large \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \small = \large \frac{t(n) + T(n)}{t(n) + T(n)/p} \small = \large \frac{t(n)/T(n) + 1}{t(n)/T(n) + 1/p}$
$\small \lim_{n\rarr \infin} S_p(n) = \large\frac{0 + 1}{0 + 1/p} \small = \large \large\frac{1}{1/p} \small = p$
$\color{pink}\small \lim_{n\rarr \infin} S_p(n) = p$

If the speedup is not linear
$\small \text{Tpar}(p,n) = t(n) + T(n)/f(p)$
with $\small f(p) < p$

using $\small t(n,p)$ instead of $\small t(n)$
Assuming that the parallel algorithm is work optimal we can allow the non-parallelizable part to depend on $\small p$ :
$\small t(n,p) \in O(T(n))$ for fixed $\small p$
$\small \text{Tseq}(n) \in O(T(n))$
$\small \text{Tpar}(p,n) \in O(t(n, p) + T(n)/p)$

Smallest possible parallel time

$\color{pink}\small T_\infin(n) = t(n)$

Parallelism

$\textcolor{pink}{\small \text{Parallelism} =}\large\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)} \small = \large \frac{t(n) + T(n)/1}{t(n)} \small = \textcolor{pink}{1+T(n)/t(n)}$

A small $\small t(n)$ relative to $\small T(n)$ means large parallelism.

Cost

$\small C(n) = p\cdot \text{Tpar}(p,n) = p \cdot t(n) + T(n)$

Work

$\small W \in O( p \cdot \text{Tpar}(p,n))$

Gustafson-Barsis’ law (special case)

Gustafson-Barsis’ law

Assume

$\color{pink}\small T(n) = p \cdot t(n)$

then

$\small S_p(n) = \large \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \small = \large \frac{t(n) + T(n)}{t(n) + T(n)/p} \small = \large \frac{t(n) + p\cdot t(n)}{t(n) + p\cdot t(n)/p} \small = \large \frac{t(n) \cdot(p +1)}{t(n) \cdot 2} \small = (p+1)/2$

$\color{pink}\small S_p(n) = (p+1)/2$

Efficiency

Efficiency of a parallel algorithm

Ratio of perfect parallel time / actual parallel time

Ratio of sequential cost / parallel cost

$\small \textcolor{pink}{E(p,n)} = \large \frac{\text{Tseq}(n) \textcolor{pink}{/p}}{\text{Tpar}(p,n)} \small = S_p(n) /p = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)}$

$\small E(p,n) \leq 1$
because $\small S_p(n) = \large \frac{\text{Tseq}}{\text{Tpar}(p,n)} \small \leq p$

$\small E(p,n) = c$ (constant)
has a linear speedup (because cost-optimal algorithms have constant efficiency)

Scalability

If an algorithm does not have constant efficiency and it’s speedup dependens on $\small n$

we can aim to maintain some constant efficiency $\small e$ through

increasing the problem size $\small n$ with the number of processors $\small p$

using the iso-efficiency function.

Weak scalability

Parallel algorithm scales weakly relative to sequential algorithm

if there is a slowly growing iso-function $\small f$ such that:

$\color{pink}\small n \in \Omega(f(p)) \implies E_p(n) = e$

The growth of $\small f$ is given by the input-size-scaling-function $\small g$ .

Average work per processor must be constant by increasing $\small n$ .
$\small w = \text{Tseq}(n)/p$

$\small \text{Tpar}(p,n)$ must be constant

$\color{pink} \small g(p) = 1/\text{Tseq}(p\cdot w)$

$\color{pink}\small f(p) \in O(g(p))$

Exception to this rule:

If sequential time > linear then constant efficiency requires $\small n$ to grow faster than said here.

In that case constant work is maintained with decreasing efficiency.

Strong scalability

Algorithm is strongly scalable if $\small f(p) = O(1)$ .

This is almost never the case: all algorithms are at best weakly scalable.

At least as much work is required as there are processors.

But often, constants and lower order terms can safely be ignored - then that the algorithm is strongly scalable for some range of $\small n$ and $\small p$ .

Examples: Calculating the iso-efficiency function

Example 1
What we know:
- $\small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log n)^2)$
- $\small \text{Tseq}(n) \in O(n^2)$
- $\small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)\in O(\text{Tseq}(n))$ because work optimal
We want to get a constant efficiency $\small e$ as $\small E(p,n)$ :
$\small e = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)} \small = \large \frac{n^2}{\textcolor{pink}p \cdot ({\frac{n^2}{p}} + (\log n)^2)} \small = \large \frac{n^2}{n^2 +p\cdot (\log n)^2}$
$\small e = \large \frac{n^2}{n^2 +p\cdot (\log n)^2}$
$\small e \cdot (n^2 +p\cdot (\log n)^2)= n^2$
$\small e \cdot n^2 +e \cdot p\cdot (\log n)^2= n^2$
$\small e \cdot p\cdot (\log n)^2= n^2 - e \cdot n^2$
$\small e \cdot p\cdot (\log n)^2= n^2 \cdot (1 - e)$
$\cdots$ $\small O$ constants normalized to 1
$\large \frac{n}{\log n}\small = \sqrt{e/(1-e)} \cdot \sqrt{p}$
This means we can maintain efficiency, at least $\small e$ , if:
$\large \frac{n}{\log n}\small \geq \sqrt{e/(1-e)} \cdot \sqrt{p}$

Example 2
What we know:
- $\small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log \textcolor{green} p)^2)$
- $\small \text{Tseq}(n) \in O(n^2)$
- $\small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)\in O(\text{Tseq}(n))$ because work optimal
We want to get a constant efficiency $\small e$ as $\small E(p,n)$ :
$\small e = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)} \small = \large \frac{n^2}{\textcolor{pink}p \cdot ({\frac{n^2}{p}} + (\log \textcolor{green} p)^2)} \small = \large \frac{n^2}{n^2 +p\cdot (\log \textcolor{green} p)^2}$
$\small e = \large \frac{n^2}{n^2 +p\cdot (\log \textcolor{green} p)^2}$
$\small e \cdot (n^2 +p\cdot (\log \textcolor{green} p)^2)= n^2$
$\small e \cdot n^2 +e \cdot p\cdot (\log \textcolor{green} p)^2= n^2$
$\small e \cdot p\cdot (\log \textcolor{green} p)^2= n^2 - e \cdot n^2$
$\small e \cdot p\cdot (\log \textcolor{green} p)^2= n^2 \cdot (1 - e)$
$\cdots$ $\small O$ constants normalized to 1
$\small n = \sqrt{e/(1-e)} \cdot \sqrt{p \log p}$
This means we can maintain efficiency, at least $\small e$ , if:
$\small n \geq \sqrt{e/(1-e)} \cdot \sqrt{p \log p}$

Comparision of examples
Both kinds of algorithms occur frequently, none of them is easier to handle than the other.
Example 1)
$\small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log n)^2)$
$\large \frac{n}{\log n}\small \geq \sqrt{e/(1-e)} \cdot \sqrt{p}$
parallel overhead = a function of problem size
Example 2)
$\small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log \textcolor{green} p)^2)$
$\small n \geq \sqrt{e/(1-e)} \cdot \sqrt{p \log p}$
parallel overhead = a function of number of processors, caused by parallelization alone

Scalability Analysis

https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial##LimitsCosts

How well does parallel algorithm perform empirically against sequential counterpart as we scale up to the solution?

Scalability analysis is theoretical and practical.

Empirical speedup

= absolute speedup

Assumes that $\small \text{Tseq}(n)$ can be measured.

Not the case for very large $\small n,p$ .

Weak scaling analysis

Goal: running larger problem in same amount of time.

$\small p$ is increased

$\small n$ is increased

Average work per processor stays fixed

The total problem size $\small n$ is proportional to the number of processors used.

Weakly scalable if $\small \text{Tpar}(p,n)$ remains constant for non scaled input.

Strong scaling analysis

Goal: running same problem size faster.

$\small n$ is kept fixed

$\small p$ is increased

Strongly scalable up to $\small \text{Prallelism}$ -cores if $\small \text{Tpar}(p,n)$ decreases proportionally to $\small p$ .

Means linear speedup $\small S_p(n) \in \Theta(p)$ independently of sufficiently large $\small n$ .