Loops of independent iteration
Loops
Examples of data independent loops
Example 1: Addition
for (i=0; i<n; i++) { a[i] = b[i]+c[i]; }
Example 2: Matrix-Vector multiplication
Matrix multiplied with Vector into output Vector .
for (i=0; i<m; i++) { // row i of A u[i] = F(i,n,A); // total cols of A / rows of v }
F(int i, int n, int **A) { // A[i][j] int d = 0; for (j=0; j<n; j++) { d += A[i][j]*v[j]; } return d; }
Row-wise parallelization:
Work optimal:
Load balancing:
Perfectly load balanced if all blocks have the same size and for all .
Load balancing issue if that is not the case.
Data independence
Data independence: Result of next iteration does not depend on previous one.
for (i=0; i<n; i++) {
a[i] = F(i,n,…);
}
Function
F
must:
only depend on
n
,
i
and constants.
have no side effects.
Trivially parallelizable:
Divide iteration space into blocks of .
Parallelization
The division of the work / iteration space and assignment to processors can be:
explicit
less explicit, automatic
implicit, fully automatic
Terms like “loop” and “iteration” imply order but the point of the parallel iteration is that computations can be carried out in any order.
1) Explicit parallelization
Explicit, static mapping of computation to processor.
Data assumed to be available to each processor.
Each processor gets assigned a specific block of iterations
n[j] ≤ i < n[j+1]
explicitly.
for (i=n[j]; i<n[j+1]; i++) { // n[j] = j*(n/p)
a[i] = F(i,n,…);
}
2) Less explicit, automatic parallelization
Iteration space marked as parallelizable with a programming model / interface.
Compiler assumes independence. (Dependency analysis is undecidable)
Possibly dynamic load balancing / rescheduling.
par (i=0; i<n; i++) { // alternatively: parallel for, forall
a[i] = F(i,n,…);
}
3) Implicit, fully automatic parallelization
This is only possible in special cases.
Automatic pattern parallelization by compiler.
for (i=0; i<n; i++) {
a[i] = F(i,n,…);
}
Dependencies
Dependency analysis is undecidable.
Bernstein’s sufficient independence conditions
Program fragment
Input - Variables read in
Output - Variables written in
Program order / program flow
Dependency
based on “Bernstein’s sufficient independence conditions”.
Assuming then fragments are dependent and can’t be executed in parallel if any of the following applies:
(else they are independent and can be executed in parallel)
True / Flow dependency - Output value gets read
Anti dependency - but the execution order doesn’t change
Output dependency - race condition if fragments run concurrently
Loop carried dependencies
Loop carried dependencies
Dependencies carried between loop iterations.
Loops then cannot be parallelized correctly without additional care like: dependency isolation, synchronization, ...
Loop carried flow dependency
Assignment in later
i
dependes on previous iteration
i-k
.
for (i=k; i<n; i++) {
a[i] = a[i-k] + a[i];
}
Loop carried anti-dependency
Can possibly be checked by compiler.
for (i=0; i<n-k; i++) {
a[i] = a[i] + a[i+k];
}
Elimination of anti-dependency
Can sometimes be removed by introducing additional arrays
for (i=0; i<n-k; i++) { a[i] = a[i] + a[i+k]; }
for (i=0; i<n-k; i++) { b[i] = a[i] + a[i+k]; }
Array
b
then has to be copied back intoa
or they must be swapped (more efficient if arrays are large).
Loops carried output independently
Can not be checked by the compiler (halting problem).
for (i=1; i<n; i++) {
if (isprime(i)) b[k] = a[i];
}