Work as a DAG

Work

Work to solve problemP\small Pwith input of sizen\small n :

can be described as set of smaller tasks

executed in some order, constrained by dependencies.

Work

Wseq(n)=Tseq(n)\small \text{Wseq}(n) = \text{Tseq}(n) total number of executed instructions

W(n)\small W(n)orWpar(n)=0i<pTi(n)\small \text{Wpar}(n) = \sum_{0\leq i <p} T_i(n) by all processors (not including idle / waiting times)


Wpar(n)=0i<kWi(n)Tseq(n) \small \text{Wpar}(n)= \sum_{0 \leq i < k} \text{W}_i(n) \geq \text{Tseq}(n) 

Wpar(n)Tseq(n)\color{pink}\small \text{Wpar}(n) \geq \text{Tseq}(n)

If this isnotthe case there must be a better algorithm thanTseq(n)\small \text{Tseq}(n)→ contradiction.

Work optimality

Wpar(p,n)O(Wseq(n))\small \text{Wpar}(p,n) \in O(\text{Wseq}(n) )

Should be the case here.

Fastest running time / fastest possible parallel

T(n)\small T_\infin(n)

= smallest possible running-time of Par\small \text{Par} with arbitrarily many processors

= smallest number of parallel steps in which work can be done is

Work-time relationship of parallel algorithms

Work should be work-optimal

Assuming that load balanced and work assigned evenly to processors:

Tpar(p,n)O(Wpar(p,n)/p+T(n)) \color{pink}\small \text{Tpar}(p,n) \in O(\text{Wpar}(p,n)/p + T_\infin(n)) 

Work as DAG (Directed Acyclic Graph)

Computation can be structured as tree / directed acyclic graph DAG.

Taskti\small t_i

a sequential computation (strand)

requires input data (from other tasks) - or else can not be executed

produces output data (to other tasks)

Work as DAG

Work shown with directed acyclic graph G=(V,E)\small G= (V,E)

V\small V vertices, task set {t1,t2,,tn}\small \{t_1, t_2, \dots, t_n \}

root called start node

leaf called final node

nodes without ancestors called ready nodes

E\small E data dependencies titj\small t_i \rarr t_j

means that ti\small t_i can’t be executed before tj\small t_j has completed (dependency).

transfering output of one task to the input of another called communication cost .

tj\small t_j is dependent from ti\small t_i if there is a path from ti\small t_i to tj\small t_j .


Node task ti\small t_i / strand / work package

Input nodes no incoming edges

Output nodes no outgoing edges

Dependence when \small \exists path between two tasks

Parallel execution Tasks scheduled to processors with respect to dependencies

Tp(n)\small \text{Tp}(n)orTpar(p,n)\small \text{Tpar}(p,n) Time for p\small p processors with given schedule

Wi(n)\small W_i(n) Work of ti\small t_i : number of sequential operations

W(n)=0i<kWi(n)\small \text{W}(n)= \sum_{0 \leq i < k} \text{W}_i(n) total work, sum of nodes (ignoring communication costs)

Depth / Span / Fastest possible parallel

Execution with unlimited number of processors.

Number of operation in heaviest path from input to output node:

T(n)=iheaviest pathWi(n) \small T_\infin(n) = \sum_{i \in \text{heaviest path}}W_i(n) 

in a good DAG, it shouldn’t be a constant fraction ofW(n)\small W(n)

Therefore:

Tpar(p,n)T(n)\small \text{Tpar}(p,n) \geq T_\infin(n)

Lower bounds

Work lawTpar(p,n)Wpar(n)/pTseq(n)/p \small \text{Tpar}(p,n) \geq \text{Wpar}(n)/p \geq \text{Tseq}(n)/p 

Depth lawTpar(p,n)T(n)\small \text{Tpar}(p,n) \geq T_\infin(n)

Max speedupWpar(n)/T(n)\small \text{Wpar}(n)/ T_\infin(n)- also called paralellism


Sp(n)=Tseq(n)/Tpar(p,n)Tseq(n)/T(n)Wpar(n)/T(n) \small S_p(n) = \text{Tseq}(n)/\text{Tpar}(p,n) ≤ \text{Tseq}(n)/T_∞(n) ≤ \text{Wpar}(n)/T_∞(n) 

Independence

when there is no path from ti\small t_i to tj\small t_j and vice versa.

can be executed in parallel.

Execution (= topological order)

Repeat until all tasks are executed:

  1. Pick a task that is ready (no incoming edges / dependencies)
  1. Execute task by some processor
  1. Remove edges to all its children (successor tasks become available “ready nodes”)

    “Ready nodes” can be executed on available processors.

  1. Remove task

Best possible execution time

W(n)/p\small \text{W}(n)/ p

therefore Tpar(p,n)Wpar(n)/pTseq(n)/p \small \text{Tpar}(p,n) \geq \text{Wpar}(n)/p \geq \text{Tseq}(n)/p 

DAG Design guidelines

“work-depth analysis” gives more informative upper-bounds on speedups than naive Amdahls’ law:

  • Look at “critical path” instead of just “sequential fraction”.

    Design for light critical path T\small T_\infin compared to total work W\small W .

  • Find ways to specify good DAGs by:

    Wpar=Tseq\small \text{Wpar} = \text{Tseq}

    using small T\small T_\infin

  • Find efficient way to schedule DAG nodes:

    num of parallel steps max(W(n)/p,T)\small \approx \max(W(n)/p,~T_\infin)

Scheduling Work-DAGs

Scheduling

= Finding ready task ti\small t_i to assign to a processor k[0;p1]\small k \in [0;~p-1] at start time sj\small s_j .

respect dependencies don’t start tasks predecessors have been completed

respect processors only assign one task to a processor at a time

Scheduling problem (NP-Hard)

finding the optimal schedule for some optimization criteria

ie. minimum completion time of last task

Scheduling on a single processor

Execute in topological order

Special cases

Fork-join parallelism (OpenMP programming model)

sequential control flow

fork spawning new parallel tasks

join joining back to sequential control flow

Linear pipeline

d\small d pipeline depth (number of stages)

m\small m iterations

Greedy DAG scheduling

Greedy DAG scheduler

Goal = Assigning as many tasks as possible to available processors.

“Complete step” = at least p\small p tasks are ready or being executed (otherwise called incomplete ).

Theorem

Greedy DAG scheduler executes with work W\small W and depth T\small T_\infin in:

Tpar(p,n)W/p+T\small \text{Tpar}(p,n) \leq W / p + T_\infin time steps

Based on the other laws:

Tpar(p,n)W/p+T2max(W/p,T) \small \text{Tpar}(p,n) \leq W / p + T_\infin \leq 2 \cdot \max(W / p, T_\infin ) 

  • Work law:Tpar(p,n)W/n\small \text{Tpar}(p,n) \geq W/n
  • Depth law:Tpar(p,n)T(n)\small \text{Tpar}(p,n) \geq T_\infin(n)

Achieved Tpar(p,n)\small \text{Tpar}(p,n) is therefore usually within a factor of 2\small 2 from the optimal schedule (with minimal time as optimization criteria).