Limits of parallel algorithms

Definition

Algorithms are defined in specified models, then implemented (for a specific parallel computer) .

Basics

I(n)\small I(n) input of size n\small n

P(I)\small P(I) problem P\small P taking input I\small I

P(n)\small P(n)stands forP(I(n))\small P(I(n))

p\small pdedicaed parallel processor (we also pay for idle-time)

all start at the same time and end when the last processor finishes

Seq\small \text{Seq} sequential algorithm for specified problem and input

Par\small \text{Par} parallel algorithm for specified problem and input

Time

Tseq(n)\color{pink}\small \text{Tseq}(n) time for 1\small 1 processor to solve P(n)\small P(n) using Seq\small \text{Seq}

best known worst-case time-complexity (or experimentally measured value)

Tpar(p,n)\color{pink}\small \text{Tpar}(p,n) time for p\small p processors to solve P(n)\small P(n) using Par\small \text{Par}

=max0i<pTi(n)\color{pink}\small = \max_{0\leq i<p} T_i(n)(alternativelymincan be used)

Work

Wseq(n)=Tseq(n)\color{pink}\small \text{Wseq}(n) = \text{Tseq}(n) total number of executed instructions

Wpar(n)=0i<pTi(n) \color{pink}\small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)  by all processors (not including idle / waiting times)

Work of P\small P the work of the best possible algorithm.

Cost

total time in which the processors are reserved (have to be paid for)

C(n)=pTpar(p,n)\color{pink}\small C(n) = p\cdot \text{Tpar}(p,n)

Absolute speedup

Absolute speed-up

Usually n\small n is kept fixed and p\small p gets varied.

Sp(n)=Tseq(n)Tpar(p,n)\color{pink}\small S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)}Sp​(n)=Tpar(p,n)Tseq(n)​
  • Upper boundary of processors

    Implicitly there is an upper boundary of processors for which it does not make sense to add additional processors.

    1. Sp(n)=c1p\color{pink}\small S_p(n) = c_1\cdot p

      for some c1<1\small c_1 <1

    1. Sp(n)=c2p \color{pink}\small S_p(n) = c_2\cdot \sqrt p 

      for some c2<1\small c_2 <1

Limit of absolute speedup

Linear speedup Sp(n)O(p)\small S_p(n) \in O(p) can not be exceeded.

Tseq(n)pTpar(p,n) \color{pink}\small \small \text{Tseq}(n) \leq p \cdot \text{Tpar}(p,n) 

Perfect speed-up is rare and hardly achievable.

Sometimes we experimentally get super-linear speedup but its because of randomization, non determinism or because algorithms are not comparable (also possible by using caches on multiple processors).

  • Proof through contradiction

    Sequential simulation construction

    (Wrongly) assume that we can simulate a parallel algorithm sequentially by:

    1. executing one step of each process, for C(n)\small C(n) iterations
    1. all steps of a single process until a communication/synchronization point, then the steps of the next process and so on, ...

    Proof through contradiction

    Assume in Tsim(n)\small \text{Tsim}(n) simulates Tpar(p,n)\small \text{Tpar}(p,n)

    We know that:

    pTpar(p,n)Tsim(n) \small p \cdot \text{Tpar}(p,n) \geq \text{Tsim}(n) 

    The simulation can not take more time than thanp(max0i<pTi(n)) \small p\cdot (\max_{0\leq i<p} T_i(n)) 

    To assume that Sp(n)>p\small S_p(n) > p for some n\small n :

    Sp(n)=Tseq(n)Tpar(p,n)>p S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} > p 

    impliesTseq(n)>pTpar(p,n) \small \small \text{Tseq}(n) > p \cdot \text{Tpar}(p,n) 

    impliesTseq(n)>pTpar(p,n)Tsim(n) \small \small \text{Tseq}(n) > p \cdot \text{Tpar}(p,n) \geq \text{Tsim}(n) (see above)

    impliesTseq(n)>Tsim(n) \small \text{Tseq}(n) > \text{Tsim}(n) 

    Contradiction:

    Tseq(n)>Tsim(n) \small \text{Tseq}(n) > \text{Tsim}(n)  would mean, that Tseq(n)\small \text{Tseq}(n) is the best known/possible time.


    Therefore:

    Sp(n)=Tseq(n)Tpar(p,n)p S_p(n) = \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \leq p 

    Tseq(n)pTpar(p,n) \small \small \text{Tseq}(n) \leq p \cdot \text{Tpar}(p,n) 

Cost-optimal, Work-optimal

Cost-optimal parallel algorithm

C(n)=pTpar(p,n)O(Tseq(n)) \color{pink}\small C(n) = p \cdot \text{Tpar}(p,n) \in O(\text{Tseq}(n)) 

has linear speedup.

  • Proof

    Cost-optimal algorithm with

    pTpar(p,n)=cTseq(n)O(Tseq(n)) \small p\cdot\text{Tpar}(p,n) = c\cdot\text{Tseq}(n) \in O(\text{Tseq}(n)) 

    impliesTpar(p,n)=cTseq(n)/p \small \text{Tpar}(p,n) = c\cdot\text{Tseq}(n)/p 

    impliesSp(n)=Tseq(n)/Tpar(p,n)=p/c \small S_p(n) = \text{Tseq}(n)/\text{Tpar}(p,n) = p/c 

    The constant c\small c stands for the load imbalances and overheads.

    The smaller c\small c is, the closer we are to the perfect speed-up.

  • Example 1

    ForTseq(n)O(n)\small \text{Tseq}(n) \in O(n)

    anyTpar(p,n)O(n/p)\small \text{Tpar}(p,n) \in O(n/p)is cost-optimal.

    Proof:

    Tpar(p,n)O(n/p)\small \text{Tpar}(p,n) \in O(n/p)

    impliespTpar(p,n)pO(n/p) \small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/p) 

    impliespTpar(p,n)pO(n/p)p(cn/p)=cnO(n) \small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/p) \leq p \cdot (c\cdot n/p) = c \cdot n \in O(n) 

    for some constant c\small c

    impliespTpar(p,n)O(n)\small p \cdot \text{Tpar}(p,n) \in O(n)

  • Example 2

    ForTseq(n)O(n)\small \text{Tseq}(n) \in O(n)

    the parallel timeTpar(p,n)O(n/p)\small \text{Tpar}(p,n) \in O(n/\sqrt{p})is not cost-optimal.

    Proof:

    Tpar(p,n)O(n/p)\small \text{Tpar}(p,n) \in O(n/\sqrt{p})

    impliespTpar(p,n)pO(n/p) \small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/\sqrt{p}) 

    impliespTpar(p,n)pO(n/p)p(cn/p)=cnpO(n) \small p \cdot \text{Tpar}(p,n) \in p \cdot O(n/\sqrt{p})\leq p \cdot (c\cdot n/\sqrt p) = c \cdot n \cdot \sqrt{p} \notin O(n) 

    for some constant c\small c

    impliespTpar(p,n)O(n) \small p \cdot \text{Tpar}(p,n) \notin O(n) 

  • Example 3

    ForTseq(n)O(n)\small \text{Tseq}(n) \in O(n)

    the parallel timeTpar(p,n)O(n/(p/logp))=O((nlogp)/p) \small \text{Tpar}(p,n) \in O(n/(p/\log p)) = O((n \log p)/p) is not cost-optimal.

    Proof:

    Tpar(p,n)O(n/(p/logp))=O((nlogp)/p) \small \text{Tpar}(p,n) \in O(n/(p/\log p)) = O((n \log p)/p) 

    impliespTpar(p,n)pO((nlogp)/p) \small p \cdot \text{Tpar}(p,n) \in p \cdot O((n \log p)/p) 

    impliespTpar(p,n)pO((nlogp)/p)p(c(nlogp)/p)=cnlogpO(n) \small p \cdot \text{Tpar}(p,n) \in p \cdot O((n \log p)/p)\leq p \cdot (c \cdot (n \log p) / p)= c \cdot n \cdot \log p\notin O(n) 

    for some constant c\small c

    impliespTpar(p,n)O(n) \small p \cdot \text{Tpar}(p,n) \notin O(n) 

Work-optimal parallel algorithm

Wpar(n)O(Wseq(n))=O(Tseq(n)) \color{pink}\small \text{Wpar}(n) \in O(\text{Wseq}(n)) = O(\text{Tseq}(n)) 

has potential for linear speedup.

  • Proof

    Lets say there is a work-optimal algorithm with:

    Tpar(p,n)=max0i<pTi(n) \small \text{Tpar}(p,n) = \max_{0\leq i<p} T_i(n) 

    0i<pTi(n)=Tseq(n) \small \sum_{0\leq i <p} T_i(n) = \text{Tseq}(n) 

    We schedule Par\small \text{Par} on a smaller number of processors p\small p' in such a way that no processor has any “idle time” (pretty difficult to do this in practice) .

    Then:

    0i<pTi(n)=pTpar(p,n)O(Tseq(n)) \small \sum_{0\leq i <p'}T_i(n) = p' \cdot \text{Tpar}(p',n)\in O(\text{Tseq}(n)) 

    This way the algorithm is cost-optimal and therefore has linear speedup.

Break even

= p\small p required to be faster than sequential algorithm.

  • Example

    Given DumbSort(n)\small \text{DumbSort(n)} with T(n)\small T(n) that can be perfectly parallalized:

    Tpar(p,n)O(n2/p)\small \text{Tpar}(p, n) \in O(n^2/p)

    Tseq(n)θ(nlogn)\small \text{Tseq}(n) \in \theta(n \log n)

    Sp(n)=Tseq(n)/Tpar(p,n)=O(nlogn)/O(n2/p)=O(p(logn)/n) \small S_p(n) = \text{Tseq}(n)/\text{Tpar}(p,n) = O(n \log n)/O(n^2/p) = O(p (\log n)/n) 

    • Linear speedup for fixed n\small n .
    • Not work-optimal: Speedup decreasing with n\small n .

    Break even

    Number of processors for parallel algorithm to be faster than sequential algorithm:

    Tpar(p,n)<Tseq(n)n2/p<nlognn/p<lognp>n/logn \small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr n^2/p < n \log n \Lrarr n/p <\log n \Lrarr p > n/\log n 

    Tpar(p,n)<Tseq(n)p>n/logn \small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr p > n/\log n 

Relative speedup

Relative speedup

Expresses scalability = how well the processors get utilized.

SRelp(n)=Tpar(1,n)Tpar(p,n)\color{pink}\small SRel_p(n) = \frac{\text{Tpar}(1,n)}{\text{Tpar}(p,n)}SRelp​(n)=Tpar(p,n)Tpar(1,n)​

Relative speedup is good if SRelp(n)Θ(p)\small SRel_p(n) \in \Theta(p)

Easier to to achieve than absolute.

Fastest running time / fastest possible parallel

= smallest possible running-time of Par\small \text{Par} with arbitrarily many processors

T(n)=Tpar(p,n)\small T_\infin(n) = \text{Tpar}(p',n)

where p:Tpar(p,n)Tpar(p,n) \small \forall p: \text{Tpar}(p',n) \leq \text{Tpar}(p,n) 

n,p:T(n)Tpar(p,n) \color{pink}\small \forall n,p: T_\infin(n) \leq \text{Tpar}(p,n) 

Used in parallelism:

SRelp(n)=Tpar(1,n)Tpar(p,n)≤Tpar(1,n)T∞(n)=Parallelism\small \textcolor{grey}{SRel_p(n) = \frac{\text{Tpar}(1,n)}{\text{Tpar}(p,n)} } \textcolor {pink} \leq \frac{\text{Tpar}(1,n)}{\text{T}_\infin(n)} \textcolor{gray}{= \text{Parallelism }}SRelp​(n)=Tpar(p,n)Tpar(1,n)​≤T∞​(n)Tpar(1,n)​=Parallelism

Relative speedup and work-optimality

If work-optimal, then absolute AND relative speedup are the same asymptotically:

Tpar(1,n)=O(Tseq(n))\small \text{Tpar}(1,n) = O(\text{Tseq}(n))

Parallelism

Parallelism

= largest number of processors that can still give linear, relative speedup (adding more processors is pointless)

SRelp(n)Parallelism<p \small \small SRel_p(n) \leq \text{Parallelism} \small < p' 

SRelp(n)<p\color{pink} \small \small SRel_p(n) < p'

Parallelism=Tpar(1,n)T∞(n)\small \text{Parallelism }= \color{pink}\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)}Parallelism=T∞​(n)Tpar(1,n)​
  • Example

    CRCW, Max finding algorithm

    par (0<=i<n) b[i] = true; // set all to true
    par (0<=i<n, 0<=j<n)
    if (a[i] < a[j]) b[i] = false; // set those where a larger one exists to falsepar (0<=i<n)
    if (b[i]) x = a[i];

    Wpar(p,n)O(n2)\small \text{Wpar}(p,n) \in O(n^2)

    Tseq(n)=Wseq(n)O(n) \small \text{Tseq}(n) = \text{Wseq}(n) \in O(n) (best known sequential algorithm)

    Absolute speedup

    Not work-optimal: Wpar(n)O(Tseq(n)) \small \text{Wpar}(n) \notin O(\text{Tseq}(n)) 

    Sp(n)=Tseq(n)/Tpar(p,n)=O(n)/O(n2/p)=O(p/n) \small S_p(n) = \textcolor{green}{\text{Tseq}(n)}/\text{Tpar}(p,n) = O(\textcolor{green}n)/O(n^2/p) = O(p/n) 

    • Linear absolute speedup for fixed n\small n .
    • Absolute speedup decreasing with n\small n .

    Break even

    Number of processors for parallel algorithm to be faster than sequential algorithm:

    Tpar(p,n)<Tseq(n)n2/p<np>n \small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr n^2/p < n \Lrarr p > n 

    Tpar(p,n)<Tseq(n)p>n \small \text{Tpar}(p,n) <\text{Tseq}(n) \Lrarr p > n 

    This is bad! We need as many processors as we the problem size.


    Relative speedup

    SRelp(n)=Tpar(1,n)/Tpar(p,n)=O(n2)/O(n2/p)=O(p) \small SRel_p(n) =\textcolor{green}{ {\text{Tpar}(1,n)}}/{\text{Tpar}(p,n)} \small = O(\textcolor{green}{n^2}) / O(n^2/p) = O(p) 

    Linear relative speedup.

    Parallelism

    T(n)=O(1)\small T_\infin(n)= O(1)

    Tpar(1,n)T(n)=O(n2)/O(1)=O(n2) \large\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)} \small = O(n^2) / O(1) = O(n^2) 

    Great parallelism.

    Conclusion: This (terrible) parallel algorithm has

    linear relative speedup for p\small p up to n2\small n^2 processors.

Overhead

Parallelization Overhead

= Work that does not have to be done by a sequential algorithm like cooridination (communication, synchronization) and redundant computation.

When there is no overhead we can parallelize perfectly: Tpar(p,n)=Tseq(n)/p\small \text{Tpar}(p,n) = \text{Tseq}(n)/p


Each Processor has Task Ti(n)\small T_i(n) and Overhead Oi(n)\small O_i(n) in its work:

Wpar(p,n)=0i<pTi(n)+Oi(n) \small \text{Wpar}(p,n) = \sum_{0\leq i <p} T_i(n) + O_i(n) 

Therefore

Tpar(1,n)Tseq(n) \color{pink}\small \text{Tpar}(1,n) \geq \text{Tseq}(n) 

Communication overhead model

Toverhead(p,ni)=α(p)+βni \color{pink} \small \text{Toverhead}(p, n_i) = \alpha(p) + \beta \cdot n_i for processori\small i:

α(p)\small \alpha(p) latency, dependent on p\small p

βni\small \beta \cdot n_i cost per data item, that needs to be exchanged by processor i\small i

for synchronization operations it’sni=0\small n_i = 0

Cost-optimal

If cost-optimal

pTpar(p,n)=kTseq(n) \small p\cdot \text{Tpar}(p,n) = \textcolor{pink}k \cdot \text{Tseq}(n) 

absolute speedup still is linear (but not perfect)

Sp(n)=1kp \small S_p(n) = \large{\frac{1}{\textcolor{pink}k}} \small \cdot p 

Load Balancing

Load imbalance

Differece betweeen max and min work+overhead.

max(Ti(n)+Oi(n))min(Ti(n)+Oi(n)) \color{pink}\small |\max (T_i(n) + O_i(n)) ~~-~~ \min (T_i(n) + O_i(n)) | 

Load balancing

Achieving an even amount of work for all processors i,j\small i,j :

Tpar(i,n)Tpar(j,n) \color{pink}\small \text{Tpar}(i,n) \approx \text{Tpar}(j,n) 

Static, oblivious

Splitting problem in p\small p pieces, regardless of the input (except it’s size) .

Static, problem dependent, adaptive

Splitting problem in p\small p pieces, using the structure of the input.

Dynamic

Dynamically (during program execution) readjusting the work assigned to processors.

This means some overhead.

Amdahl’s law: Sequential bottleneck

When static, oblivious - load balancing:

A)Seq\small \text{Seq} perfectly parallelizable:

Tpar(p,n)O(Tseq(n)/p)\small \text{Tpar}(p,n) \in O(\text{Tseq}(n)/p)

B)Seq\small \text{Seq} has a fractions(n)\small s(n) that can’t be parallelized dependens onn\small n :

Tseq(n)=s(n)Tseq(n)+(1s(n))Tseq(n)undefinedcan be parallelized \small \text{Tseq}(n) = s(n) \cdot \text{Tseq}(n) + \overbrace{ (1-s(n)) \cdot \text{Tseq}(n) }^{\text{can be parallelized}} 

Tpar(p,n)=s(n)Tseq(n)+1s(n)pTseq(n) \small \text{Tpar}(p,n) = s(n) \cdot \text{Tseq}(n) + \large \frac{1-s(n)}{p} \small \cdot \text{Tseq}(n) 

C)Seq\small \text{Seq} has a fractions\small s that can’t be parallelized constant and independent ofn\small n :

Tseq(n)=(s+r)Tseq(n)\small \text{Tseq}(n) = (s+r)\cdot \text{Tseq}(n)

Tpar(p,n)sTseq(n)+rpTseq(n) \small \text{Tpar}(p,n) \textcolor{pink}\geq s \cdot \text{Tseq}(n) + \large \frac{r}{p} \small \cdot \text{Tseq}(n) 

Max speedup gets very limited and the constant fraction becomes a “sequential overhead”.

Amdahl’s Law (parallel version): Sequential bottleneck / overhead

Let program of Seq\small \text{Seq} contain two fractions, independent of n\small n :

  • “perfectly parallelizable” fraction r\color{pink}\small r
  • “purely sequential” fraction s=(1r)\color{pink}\small s=(1-r)(can’t be parallelized at all)

The maximum achievable speedup (independent ofn\small n) :

Sp(n)=1/s\color{pink}\small S_p(n) = 1/s

  • Proof

    Tseq(n)=(s+r)Tseq(n) \small \text{Tseq}(n) = (s+r)\cdot \text{Tseq}(n) 

    Tpar(p,n)=sTseq(n)+rpTseq(n) \small \text{Tpar}(p,n) = s \cdot \text{Tseq}(n) + \large \frac{r}{p} \small \cdot \text{Tseq}(n) 

    Sp(n)=Tseq(n)/Tpar(n)=1/(s+r/p) \small S_p(n) = \text{Tseq}(n) / \text{Tpar}(n)= 1/(s+r/p) 

    Sp(n)=1/(s+r/p)\color{pink}\small S_p(n) = 1/(s+r/p)

    limpSp(n)=1/s \color{pink}\small \lim_{p ~\rarr~ \infin}~~ S_p(n) = 1/s 


We want to avoid Amdahl’s law at all cost.

I/O, initialization of global data structures, shared data structures, everything that takes O(n)\small O(n) time and work

Examples

Sequential algorithms that could be a constant fraction:

  • Example: malloc
    // Sequential initialization
    // O(1)
    x = (int*)malloc(n*sizeof(int));
    ...// Parallelizable part
    do {
    for (i=0; i<n; i++) {
    x[i] = f(i);
    }// check for convergence
    done = …;} while (!done)

    K\small K iterations before convergence.

    Assumptions: convergence check and f(i)O(1)\small \mathrm f(i)\in O(1)

    Tseq(n)=1+Kundefinedsequential part+Knundefinedfor-loop \small \text{Tseq}(n) = \overbrace{\textcolor{green} 1 + K }^{\text{sequential part}}+ \overbrace{K\cdot n}^{\text{for-loop}} 

    Tpar(p,n)=1+K+Kn/p \small \text{Tpar}(p,n) = \textcolor{green}1 + K + K\cdot n/p 

    Sequential fraction:

    s=(1r)=1(Kn1+K+Kn/p)1/(1+n) \small s=(1-r) = 1-(\frac{K\cdot n}{1+K+K\cdot n/p}) \approx 1/(1+n) 

    Speedup:

    limp,n>p,nSp(n)=1/s=p \small \lim_{p ~\rarr~ \infin, \textcolor{green}{~~n>p, ~ ~n ~\rarr~ \infin}}~~ \\\quad S_p(n) = 1/s = \xcancel \infin ~p 

    sincep\small pis our upper boundary

    👉
    A constant sequential part (not a constant fraction), does not limit the speedup
  • Example: calloc
    // Sequential initialization
    // O(n)
    x = (int*)calloc(n*sizeof(int));
    ...// Parallelizable part
    do {
    for (i=0; i<n; i++) {
    x[i] = f(i);
    }// check for convergence
    done = …;} while (!done)

    K\small K iterations before convergence.

    Assumptions: convergence check and f(i)O(1)\small \mathrm f(i)\in O(1)

    Tseq(n)=n+Kundefinedsequential part+Knundefinedfor-loop \small \text{Tseq}(n) = \overbrace{\textcolor{green} n + K }^{\text{sequential part}}+ \overbrace{K\cdot n}^{\text{for-loop}} 

    Tpar(p,n)=n+K+Kn/p \small \text{Tpar}(p,n) = \textcolor{green}n + K + K\cdot n/p 

    Sequential fraction:

    s=(1r)=1(Knn+K+Kn/p)1/(1+K) \small s=(1-r) = 1-(\frac{K\cdot n}{n+K+K\cdot n/p}) \approx 1/(1+K) 

    Speedup:

    limpSp(n)=1/s=1+K \small \lim_{p ~\rarr~ \infin}~~\\\quad S_p(n) = 1/s = 1+K 

    👉
    A sequential part that is a constant fraction (dependent on n\small n ) limits the speedup

Scaled speedup

Strictly non-parallelizable usually depends on n\small nas ins(n)\small s(n) and decreases with rising n\small n .

Scaled speedup

Let program of Seq\small \text{Seq} contain two fractions dependent on n\small n :

  • “perfectly parallelizable” fraction T(n)\color{pink}\small T(n)
  • “purely sequential” fraction t(n)=(1T(n))\color{pink}\small t(n) =(1-T(n))

    This part decreases with rising n\small n . limnt(n)=0\small \lim_{n \rarr \infin} t(n) = 0

We can therefore reach linear speedup through increasing n\small n

and the fastert(n)/T(n)\small t(n)/T(n)converges, the faster the speed-up converges.

t(n)\small t(n)should increase more slowly withn\small nthanT(n)\small T(n).

We get

Tseq(n)=t(n)+T(n)\small \text{Tseq}(n) = t(n) + T(n)

Tpar(p,n)=t(n)+T(n)/p\small \text{Tpar}(p,n) = t(n) + T(n)/p

limnSp(n)=p\color{pink}\small \lim_{n\rarr \infin} S_p(n) = p

  • Proof

    We know that

    limnt(n)=0\small \lim_{n \rarr \infin} t(n) = 0

    limnt(n)T(n)=0 \color{pink}\small \lim_{n\rarr \infin} \large \frac{t(n)}{T(n)} \small = 0 

    Therefore

    Sp(n)=Tseq(n)Tpar(p,n)=t(n)+T(n)t(n)+T(n)/p=t(n)/T(n)+1t(n)/T(n)+1/p \small S_p(n) = \large \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \small = \large \frac{t(n) + T(n)}{t(n) + T(n)/p} \small = \large \frac{t(n)/T(n) + 1}{t(n)/T(n) + 1/p} 

    limnSp(n)=0+10+1/p=11/p=p \small \lim_{n\rarr \infin} S_p(n) = \large\frac{0 + 1}{0 + 1/p} \small = \large \large\frac{1}{1/p} \small = p 

    limnSp(n)=p \color{pink}\small \lim_{n\rarr \infin} S_p(n) = p 


  • If the speedup is not linear

    Tpar(p,n)=t(n)+T(n)/f(p)\small \text{Tpar}(p,n) = t(n) + T(n)/f(p)

    with f(p)<p\small f(p) < p

  • using t(n,p)\small t(n,p) instead of t(n)\small t(n)

    Assuming that the parallel algorithm is work optimal we can allow the non-parallelizable part to depend on p\small p :

    t(n,p)O(T(n))\small t(n,p) \in O(T(n))for fixedp\small p

    Tseq(n)O(T(n))\small \text{Tseq}(n) \in O(T(n))

    Tpar(p,n)O(t(n,p)+T(n)/p) \small \text{Tpar}(p,n) \in O(t(n, p) + T(n)/p) 

Smallest possible parallel time

T(n)=t(n)\color{pink}\small T_\infin(n) = t(n)

Parallelism

Parallelism=Tpar(1,n)T(n)=t(n)+T(n)/1t(n)=1+T(n)/t(n) \textcolor{pink}{\small \text{Parallelism} =}\large\frac {\text{Tpar}(1,n)}{\text{T}_\infin(n)} \small = \large \frac{t(n) + T(n)/1}{t(n)} \small = \textcolor{pink}{1+T(n)/t(n)} 

A small t(n)\small t(n) relative to T(n)\small T(n) means large parallelism.

Cost

C(n)=pTpar(p,n)=pt(n)+T(n) \small C(n) = p\cdot \text{Tpar}(p,n) = p \cdot t(n) + T(n) 

Work

WO(pTpar(p,n))\small W \in O( p \cdot \text{Tpar}(p,n))

Gustafson-Barsis’ law (special case)

Gustafson-Barsis’ law

Assume

T(n)=pt(n)\color{pink}\small T(n) = p \cdot t(n)

then

Sp(n)=Tseq(n)Tpar(p,n)=t(n)+T(n)t(n)+T(n)/p=t(n)+pt(n)t(n)+pt(n)/p=t(n)(p+1)t(n)2=(p+1)/2 \small S_p(n) = \large \frac{\text{Tseq}(n)}{\text{Tpar}(p,n)} \small = \large \frac{t(n) + T(n)}{t(n) + T(n)/p} \small = \large \frac{t(n) + p\cdot t(n)}{t(n) + p\cdot t(n)/p} \small = \large \frac{t(n) \cdot(p +1)}{t(n) \cdot 2} \small = (p+1)/2 

Sp(n)=(p+1)/2\color{pink}\small S_p(n) = (p+1)/2

Efficiency

Efficiency of a parallel algorithm

Ratio of perfect parallel time / actual parallel time

Ratio of sequential cost / parallel cost

E(p,n)=Tseq(n)/pTpar(p,n)=Sp(n)/p=Tseq(n)pTpar(p,n) \small \textcolor{pink}{E(p,n)} = \large \frac{\text{Tseq}(n) \textcolor{pink}{/p}}{\text{Tpar}(p,n)} \small = S_p(n) /p = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)} 


  • E(p,n)1\small E(p,n) \leq 1

    becauseSp(n)=TseqTpar(p,n)p \small S_p(n) = \large \frac{\text{Tseq}}{\text{Tpar}(p,n)} \small \leq p 

  • E(p,n)=c\small E(p,n) = c(constant)

    has a linear speedup (because cost-optimal algorithms have constant efficiency)

Scalability

If an algorithm does not have constant efficiency and it’s speedup dependens on n\small n

we can aim to maintain some constant efficiency e\small e through

increasing the problem size n\small n with the number of processors p\small p

using the iso-efficiency function.

Weak scalability

Parallel algorithm scales weakly relative to sequential algorithm

if there is a slowly growing iso-functionf\small f such that:

nΩ(f(p))Ep(n)=e \color{pink}\small n \in \Omega(f(p)) \implies E_p(n) = e 


The growth of f\small f is given by the input-size-scaling-functiong\small g .

  • Average work per processor must be constant by increasing n\small n .

    w=Tseq(n)/p\small w = \text{Tseq}(n)/p

  • Tpar(p,n)\small \text{Tpar}(p,n) must be constant

g(p)=1/Tseq(pw)\color{pink} \small g(p) = 1/\text{Tseq}(p\cdot w)

f(p)O(g(p))\color{pink}\small f(p) \in O(g(p))


Exception to this rule:

If sequential time > linear then constant efficiency requires n\small n to grow faster than said here.

In that case constant work is maintained with decreasing efficiency.

Strong scalability

Algorithm is strongly scalable if f(p)=O(1)\small f(p) = O(1) .

This is almost never the case: all algorithms are at best weakly scalable.

At least as much work is required as there are processors.

But often, constants and lower order terms can safely be ignored - then that the algorithm is strongly scalable for some range of n\small n and p\small p .

Examples: Calculating the iso-efficiency function

  • Example 1

    What we know:

    • Tpar(p,n)O(n2p+(logn)2) \small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log n)^2) 
    • Tseq(n)O(n2)\small \text{Tseq}(n) \in O(n^2)
    • Wpar(n)=0i<pTi(n)O(Tseq(n)) \small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)\in O(\text{Tseq}(n)) because work optimal

    We want to get a constant efficiency e\small easE(p,n)\small E(p,n):

    e=Tseq(n)pTpar(p,n)=n2p(n2p+(logn)2)=n2n2+p(logn)2 \small e = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)} \small = \large \frac{n^2}{\textcolor{pink}p \cdot ({\frac{n^2}{p}} + (\log n)^2)} \small = \large \frac{n^2}{n^2 +p\cdot (\log n)^2} 

    e=n2n2+p(logn)2 \small e = \large \frac{n^2}{n^2 +p\cdot (\log n)^2} 

    e(n2+p(logn)2)=n2 \small e \cdot (n^2 +p\cdot (\log n)^2)= n^2 

    en2+ep(logn)2=n2 \small e \cdot n^2 +e \cdot p\cdot (\log n)^2= n^2 

    ep(logn)2=n2en2 \small e \cdot p\cdot (\log n)^2= n^2 - e \cdot n^2 

    ep(logn)2=n2(1e) \small e \cdot p\cdot (\log n)^2= n^2 \cdot (1 - e) 

    \cdotsO\small Oconstants normalized to 1

    nlogn=e/(1e)p \large \frac{n}{\log n}\small = \sqrt{e/(1-e)} \cdot \sqrt{p} 

    This means we can maintain efficiency, at least e\small e , if:

    nlogne/(1e)p \large \frac{n}{\log n}\small \geq \sqrt{e/(1-e)} \cdot \sqrt{p} 

  • Example 2

    What we know:

    • Tpar(p,n)O(n2p+(logp)2) \small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log \textcolor{green} p)^2) 
    • Tseq(n)O(n2)\small \text{Tseq}(n) \in O(n^2)
    • Wpar(n)=0i<pTi(n)O(Tseq(n)) \small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n)\in O(\text{Tseq}(n)) because work optimal

    We want to get a constant efficiency e\small easE(p,n)\small E(p,n):

    e=Tseq(n)pTpar(p,n)=n2p(n2p+(logp)2)=n2n2+p(logp)2 \small e = \large\frac{\text{Tseq}(n)}{\textcolor{pink} p \cdot \text{Tpar}(p,n)} \small = \large \frac{n^2}{\textcolor{pink}p \cdot ({\frac{n^2}{p}} + (\log \textcolor{green} p)^2)} \small = \large \frac{n^2}{n^2 +p\cdot (\log \textcolor{green} p)^2} 

    e=n2n2+p(logp)2 \small e = \large \frac{n^2}{n^2 +p\cdot (\log \textcolor{green} p)^2} 

    e(n2+p(logp)2)=n2 \small e \cdot (n^2 +p\cdot (\log \textcolor{green} p)^2)= n^2 

    en2+ep(logp)2=n2 \small e \cdot n^2 +e \cdot p\cdot (\log \textcolor{green} p)^2= n^2 

    ep(logp)2=n2en2 \small e \cdot p\cdot (\log \textcolor{green} p)^2= n^2 - e \cdot n^2 

    ep(logp)2=n2(1e) \small e \cdot p\cdot (\log \textcolor{green} p)^2= n^2 \cdot (1 - e) 

    \cdotsO\small Oconstants normalized to 1

    n=e/(1e)plogp \small n = \sqrt{e/(1-e)} \cdot \sqrt{p \log p} 

    This means we can maintain efficiency, at least e\small e , if:

    ne/(1e)plogp \small n \geq \sqrt{e/(1-e)} \cdot \sqrt{p \log p} 

  • Comparision of examples

    Both kinds of algorithms occur frequently, none of them is easier to handle than the other.

    Example 1)

    Tpar(p,n)O(n2p+(logn)2) \small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log n)^2) 

    nlogne/(1e)p \large \frac{n}{\log n}\small \geq \sqrt{e/(1-e)} \cdot \sqrt{p} 

    parallel overhead = a function of problem size

    Example 2)

    Tpar(p,n)O(n2p+(logp)2) \small \text{Tpar}(p,n)\in O(\large{\frac{n^2}{p}} \small + (\log \textcolor{green} p)^2) 

    ne/(1e)plogp \small n \geq \sqrt{e/(1-e)} \cdot \sqrt{p \log p} 

    parallel overhead = a function of number of processors, caused by parallelization alone

Scalability Analysis

https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial##LimitsCosts

How well does parallel algorithm perform empirically against sequential counterpart as we scale up to the solution?

Scalability analysis is theoretical and practical.

Empirical speedup

= absolute speedup

Assumes that Tseq(n)\small \text{Tseq}(n) can be measured.

Not the case for very large n,p\small n,p .

Weak scaling analysis

Goal: running larger problem in same amount of time.

  • p\small p is increased
  • n\small n is increased
  • Average work per processor stays fixed

The total problem size n\small n is proportional to the number of processors used.

Weakly scalable if Tpar(p,n)\small \text{Tpar}(p,n) remains constant for non scaled input.

Strong scaling analysis

Goal: running same problem size faster.

  • n\small n is kept fixed
  • p\small p is increased

Strongly scalable up to Prallelism\small \text{Prallelism} -cores if Tpar(p,n)\small \text{Tpar}(p,n) decreases proportionally to p\small p .

Means linear speedup Sp(n)Θ(p)\small S_p(n) \in \Theta(p) independently of sufficiently large n\small n .