Loops of independent iteration

Loops

Examples of data independent loops

  • Example 1: Addition
    for (i=0; i<n; i++) {
    a[i] = b[i]+c[i];
    }

  • Example 2: Matrix-Vector multiplication

    m×n\small m \times n Matrix A\small A multiplied with Vector v\small v into output Vector u\small u .

    for (i=0; i<m; i++) {  // row i of A
    u[i] = F(i,n,A);     // total cols of A / rows of v
    }
    F(int i, int n, int **A) { // A[i][j]
    int d = 0;
    for (j=0; j<n; j++) {
    d += A[i][j]*v[j];
    }
    return d;
    }

    Row-wise parallelization:

    Tpar(p,n)O(mn/p+n)\small \text{Tpar}(p,n) \in O(mn/p+n)

    Tseq(n)O(mn)\small \text{Tseq}(n) \in O(mn)

    Work optimal:

    Wpar(n)=T(F(i))\small \text{Wpar}(n) = \sum T(F(i))

    Load balancing:

    Perfectly load balanced if all blocks have the same size and T(F(i))T(F(j))\small T(F(i)) \approx T(F(j)) for all i,j\small i,j .

    Load balancing issue if that is not the case.

Data independence

Data independence: Result of next iteration does not depend on previous one.

for (i=0; i<n; i++) {
a[i] = F(i,n,…);
}

Function F must:

only depend on n , i and constants.

have no side effects.

Trivially parallelizable:

Divide iteration space 0i<n\small 0 \leq i < n into p\small p blocks of n/p\small \approx n/p .

Parallelization

The division of the work / iteration space and assignment to processors can be:

explicit

less explicit, automatic

implicit, fully automatic

Terms like “loop” and “iteration” imply order but the point of the parallel iteration is that computations can be carried out in any order.

1) Explicit parallelization

Explicit, static mapping of computation to processor.

Data assumed to be available to each processor.

Each processor gets assigned a specific block of iterations n[j] ≤ i < n[j+1] explicitly.

for (i=n[j]; i<n[j+1]; i++) { // n[j] = j*(n/p)
a[i] = F(i,n,…);
}

2) Less explicit, automatic parallelization

Iteration space marked as parallelizable with a programming model / interface.

Compiler assumes independence. (Dependency analysis is undecidable)

Possibly dynamic load balancing / rescheduling.

par (i=0; i<n; i++) { // alternatively: parallel for, forall
a[i] = F(i,n,…);
}

3) Implicit, fully automatic parallelization

This is only possible in special cases.

Automatic pattern parallelization by compiler.

for (i=0; i<n; i++) {
a[i] = F(i,n,…);
}

Dependencies

Dependency analysis is undecidable.

Bernstein’s sufficient independence conditions

P\small P Program fragment

I\small I Input - Variables read in P\small P

O\small O Output - Variables written in P\small P

PiPj\small P_i \rarr P_j Program order / program flow

Dependency

based on “Bernstein’s sufficient independence conditions”.

Assuming PiPj\small P_i \rarr P_j then fragments Pi,Pj\small P_i,~P_j are dependent and can’t be executed in parallel if any of the following applies:

(else they are independent and can be executed in parallel)

OiIj\small O_i \cap I_j \ne \emptyTrue / Flow dependency - Output value gets read

OjIi\small O_j \cap I_i \ne\emptyAnti dependency - but the execution order doesn’t change

OiOj\small O_i \cap O_j \ne \emptyOutput dependency - race condition if fragments run concurrently

Loop carried dependencies

Loop carried dependencies

Dependencies carried between loop iterations.

Loops then cannot be parallelized correctly without additional care like: dependency isolation, synchronization, ...

Loop carried flow dependency

Assignment in later i dependes on previous iteration i-k .

for (i=k; i<n; i++) {
a[i] = a[i-k] + a[i];
}

Loop carried anti-dependency

Can possibly be checked by compiler.

for (i=0; i<n-k; i++) {
a[i] = a[i] + a[i+k];
}
  • Elimination of anti-dependency

    Can sometimes be removed by introducing additional arrays

    for (i=0; i<n-k; i++) {
    a[i] = a[i] + a[i+k];
    }

    \quad \quad \large\downarrow

    for (i=0; i<n-k; i++) {
    b[i] = a[i] + a[i+k];
    }

    Array b then has to be copied back into a or they must be swapped (more efficient if arrays are large).

Loops carried output independently

Can not be checked by the compiler (halting problem).

for (i=1; i<n; i++) {
if (isprime(i)) b[k] = a[i];
}