pthreads

Definition

Process

Program in execution

Thread

Used for thread parallel programming.

Smallest independent stream of instructions that can be scheduled by OS.

Each contain local variable stack, registers, thread local storage.

If scheduled by OS, then OS thread = lives inside an OS-process and shares global information (file pointers, global variables, static variables, heap storage).

POSIX threads, pthreads

POSIX = Portable Operating System Interface for uniX

Low level interface for thread programming in C/Unix

Have no cost model, no memory model, ...

Programming model

(p)threads programming model

  1. Fork-join parallelism

    A thread can spawn (start) any number of new threads and wait for them to terminate.

  1. Master thread

    There is initially only one active main thread.

    Code for threads are functions. All threads execute within the same program (SPMD) but can be different functions.

  1. Peer threads

    A thread can wait for another “peer thread” to terminate.

  1. Parallelism

    Threads are scheduled by the underlying system and may or may not run simultaneously on different processors.

  1. Synchronization between threads

    There is no implicit synchronization between threasd - the progress independently and asynchronously.

  1. Threads share process global data
  1. Coordination mechanisms

    for protecting access to shared variables (locks, condition variables).

    All concurrent updates must be protected otherwise: illegal program with undefined outcome.

  1. \dots

Pragmatics for parallel computing

Threads get scheduled and bound specified cores (non-standard).

Symmetric Multi-Processor SMP

💡
Don’t confuse symmetric multi-processor SMP with symmetric multi-processing SMP.
  • Single OS
  • all cores treated equally
  • threads not bound to cores
  • there can be more threads than cores

Explicit parallelization of data parallel loop

Increased performance by spawning threads recursively instead of using loops.

start = rank*n/p
end = (rank+1)*n/pfor (i=start; i<end; i++) {
a[i] = f(i);
}

rank is the virtual thread id and is set by the spawning thread.

  • implementation (in depth)

    Using the argument struct

    typedef struct {
    int *array;
    int n;
    int rank;
    int size;
    } array_t;
    loopblock(void *what) {
    array_t *ix = (array_t*)what;
    int *a = ix->array;
    int i;
    int n = ix->n;
    int p = ix->size; // no. of threads
    int start = ix->rank*(n/p);
    int end = start+n/p;
    for (i=start; i<end; i++) {
    a[i] = f(i);
    }
    }

  • example: matrix-vector product

    y=Xa\small \vec y = X \cdot \vec a

    Sequential solution

    for (i=0; i<n; i++) {
    y[i] = 0;
    for (j=0; j<m; j++) {
    y[i] += x[i][j]*A[j];
    }
    }

    Parallel solution

    • using cache ineffectively

      Thread with id rank rank handles every p ‘th index of matrix x

      for (i=rank; i<n; i+=p) {
      y[i] = 0;
      for (j=0;j<m; j++) {
      y[i] += x[i][j]*A[j];
      }
      }

      But this way we use the cache ineffectively and each thread writes and reads into y at the same time.

    for (i=rank*n/p; i<(rank+1)*n/p; i++) {
    y[i] = 0;
    for (j=0;j<m; j++) {
    y[i] += x[i][j]*a[j];
    }
    }

Synchronization

Synchronization / coordination constructs (from concurrent computing) prevent malicious non-determinism (ie. race conditions).

Allow solving algorithms under weak memory consistency assumptions.

Race condition

Outcome of parallel program execution depends on the relative timing of the updates by processors/threads (in an undesired way).

Data race

Several threads (or cores, processes, …) compete for updating a shared variable.

Outcome depends on relative timing.

Deadlock

Two or more threads in situation where they depend on eachother and can’t progress.

Deadlocks spread to all threads and eventually the program can not terminate.

  • example
    pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;
    pthread_mutex_t lock2 = PTHREAD_MUTEX_INITIALIZER;
    …
    pthread_mutex_lock(&lock1);
    pthread_mutex_lock(&lock2);
    …
    …
    pthread_mutex_lock(&lock2);
    pthread_mutex_lock(&lock1);
    …

Shared resources

Simple variables, data structures, devices, ... accessible by multiple threads

Critical section

The manipulated shared resources by code, are not allowed to be manipulated by other entities (threads, processes, ...) concurrently.

Mutual exclusion property

There is at most one thread in critical section at any fiven time.

Safety

pthreads memory model for “safe” programs

Thread not allowed to update variables outside of critical sections.

(Lock constructs used to ensure consistent memory views)

Thread safe functions

Function can be executed concurrently by any number of threads and will always produce the correct result.

  • Non-thread safe functions have side effects
    • Don’t protect (write access) to shared (ie. static) variables
    • Keep a state over successive invocations
    • Return pointers to static variables
    • call thread-unsafe functions

      these unsafe functions can also be system functions:

      ie. rand() , random() keep internal state in static variables

      solution: using random_r() with an explicit state, locking, ...

      problem: serialization can lead to slowdown, deadlocks, ...

  • Example: concurrent read and thread-safety

    Where x is the input.

    a = f(x);
    b = f(x);
    c = f(y);

    Conclusion: There is a concurrent read of x .

    This program is only safe for concurrent execution, if f has no side effects (is thread safe ).

Wait free vs. lock free

Wait-freedom \Rarr lock-freedom

(but lock-freedom⇏\not \Rarrwait-freedom)

Both require hardware support (ie. atomic counter).

Wait freealgorithm on concurrent data

Each thread can always do progress no matter what other threads are doing.

If thread tries to execute the operation, progress is guaranteed and does not depend on other threads or their attempt to execute the same operation.

Lock freealgorithm on concurrent data

If several threads are attempting to execute the operation, at least one of the threads can make progress on the operation.

Synchronization constructs

Slow implementation

These constructs were developed for small numbers of concurrent tasks, not scalable HPC (many threads on many cores).

Most of them have sequential parts that slow down (Amdahl).

Locks

Locks

= shared object between threads, guarantee mutual exclusion

called “mutex” in pthreads

States of locks

2 possible states:

  • acquired (locked)
  • released (unlocked)

Only one thread can acquire lock and must release it after use.

If another thread tries to acquire the locked lock, it is blocked and must wait until its release.

Properties of locks

Deadlock freedom Some thread will always succeed in acquiring the lock

Starvation freedom If a thread is trying to acquire the lock it will eventually succeed

Fairness First come first serve: thread that acquired first will get the lock first

Bounded waiting Bounded time / steps after which a thread must succeed in

acquiring the lock it is trying to acquire.


By definition, locks are not lock-free, not wait-free.

A thread that acquired the lock can keep it indefinitely while no other thread makes progress.

pthreads rule

Whatever thread T1\small T_1 can view in memory when releasing the lock,

thread T2\small T_2 can view when acquiring the lock.

Properties of pthread locks

Not fair, not starvation free, no bounded waiting - pthread locks are deadlock free but not starvation free.

Threads can starve eachother by never allowing the others to enter the critical section.

There is no guarantee that a thread will eventually acquire the lock it wants.

Lock pattern

flag = 1; x = … ;
lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);
lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);
lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

Example: using locks

  • unsafe program: race condition

    initial value: x = 0;

    if (flag) {
    a = x;
    flag = 0;
    }
    if (flag) {
    a = x;
    flag = 0;
    }
    flag = 1;
    x = 27;

    Conclusion: race condition - What happens depends on relative timing.

    • Either a is some unintended value of x (before it was initialized) or 27.
    • Both threads might enter critical secion

  • safe program (implements pattern)

    what is the intended value of a for thread0, thread1?

    pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
    x = 0;
    lock(&lock);
    if (flag) {
    a = x;
    flag = 0;
    }
    unlock(&lock);
    lock(&lock);
    if (flag) {
    a = x;
    flag = 0;
    }
    unlock(&lock);
    lock(&lock);
    flag = 1;
    x = 27;
    unlock(&lock);

    Conclusion:

    Both read and write accesses to x now protected by lock (mutex).

    Mutual exclusion, consistent memory ensured by locking.

  • unsafe program: deadlock

    what is the intended value of a for thread0, thread1?

    pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;
    pthread_mutex_t lock2 = PTHREAD_MUTEX_INITIALIZER;
    x = 0;
    lock(&lock1); // executed
    lock(&lock2); // <-- needs unlock of lock2 from thread1 to progress
    if (…)
    unlock(&lock2);
    unlock(&lock1);
    lock(&lock2); // executed
    lock(&lock1); // <-- needs unlock of lock1 from thread0 to progress
    if (…)
    unlock(&lock1);
    unlock(&lock2);
    lock(&lock1);
    flag = 1;
    x = 27;
    unlock(&lock1);

    Conclusion: Deadlock

    Deadlock occurs if thread0 and thread1 try to progress “simultaniously” by each alternately executing one instruction

Example: using global locks

  • List procesing: locking entire list

    delete(e) operation:

    1. lock(listlock)
    1. scan sequentially until element e is found
    1. dereference / free e from list via its predecessor
    1. unlock(listlock)


    This is inefficient: Thread using list blocks all other threads forO(n)\small O(n)steps for list of sizen\small n.

  • List procesing: locking each element individually

    insert(e) operation:

    We want to insert element e between e1 and e2 .

    We will need to lock e1 and e2 for this.


    delete(e) operation:

    We want to delete element e between e1 and e2 .

    We will need to lock e1 and e2 for this.


    This is solution is expensive (needs a lot of space).

    Deadlocks can still occur - We don’t know how the code for every thread is written → locks are not composable.

    Assume the operations are written by 2 different programmers.

    • Insert: first lock e1 , then e2
    • Delete: first lock e2 , then e1

    Then by executing them both a deadlock occurs

Example: using read/write locks

  • Reading arbitrary value
    pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;
    flag = false;
    rdlock(&lock);
    if (flag) {
    a = x;
    }
    unlock(&lock);
    rdlock(&lock);
    if (flag) {
    b = x;
    }
    unlock(&lock);
    wrlock(&lock);
    x = c;
    flag = true;
    unlock(&lock);

    Conclusion

    Thread0, thread1 can both enter their critical section simultaneously (read lock).

    Thread2 is always alone in its critical section (write lock).

    Therefore the thread0, thread1 either read the old x value ( == 0 ) or new value set by thread2.

  • Reading newly assigned value (deadlock possible) → can be solved with condiction variables
    pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;
    flag = false;
    rdlock(&lock);
    while(!flag) {}
    a = x;
    unlock(&lock);
    rdlock(&lock);
    while (!flag) {}
    b = x;
    unlock(&lock);
    wrlock(&lock);
    x = c;
    flag = true;
    unlock(&lock);

    Same as above but threads wait until condition is met before continuing (because of the while loop).

    Problem

    Active wait is inefficient.

    Deadlock is possible.

    Solution

    Condition variables

Using locks to avoid deadlocks

The best solution is avoiding locks altogether by replicating shared information.

  • Use one lock at a time
  • If not possible: always acquire and release in the same order (this is difficult to do with libraries, because we don’t know how they are implemented)

    Can be done by assigning a level to each lock, sorting them via their IDs, using “trylocks” (if a lock can’t be aquired, release all locks, wait, try again)

Condition variables

See above for use-case example

Condition variables / Monitor

Allows thread to temporarily release lock and wait (suspend) for condition-signal with a condition variable .

Different from semaphore object (which is not part of standard pthreads)

Waiting for signal inside critical section on condition variable:

  1. Thread acquires lock and enters critical section.
  1. Thread suspends itself, relinquishes (temporarily releases) lock but stays suspended - and waits in critical section until the condition is met.
  1. Thread is resumed (woken up) later by a signal from some other signaling thread.

    It has again acquired the lock.

Mutual exclusion

Mutual exclusion still guaranteed.

Suspended thread is only resumed when signaling thread left critical section.

Risk of deadlock

if threads mutually wait on some condition but there are no signal-threads.

weird bug in pthreads

There can be unwanted wakeups by signals from the runtime-system without explicit wakeup from other thread.

Good practice: rechecking whether wait-condition is fulfilled - even after waking up.

Standard pattern

Thread waiting for condition (signalled thread)

lock(&lock);
// critical section ::while (!condition()) {
pthread_cond_wait(&cond, &lock); // wait and release lock temporarily
}
// thread has been signaled, condition holds, lock is acquired again// :: critical section
unlock(&lock);

Thread broadcasting signal (signaling thread) — can be done inside or outside the critical section

  • Inside
    lock(&lock);
    // critical section ::... // establishing condition
    cond_signal(&cond); // signal if condition holds// :: critical section
    unlock(&lock);
    • Advantage: Establishing condition in critical section → therefore it should hold when suspended thread is resumed.
    • Disadvantage: Suspended thread will block on mutex again - until signaling thread leaves the section.
  • Outside
    lock(&lock);
    ... // establishing condition (in critical section)
    unlock(&lock);cond_signal(&cond); // signal
    • Advantage: Suspended thread can immediately continue without blocking.

Example

  • Readers-writers lock (with condition variables and locks) → unfair
    typedef struct {
    int readers; // actice readers
    int writer;	// (boolean) active writer
    int waiting; // pending writes (writers in queue)
    pthread_cond_t read_ok; // condition: reading allowed
    pthread_cond_t write_ok; // condition: writing allowed
    pthread_mutex_t gateway; // lock} rwlock_t;
    void rwlock_rlock(rwlock_t *rwlock) {
    pthread_mutex_lock(&rwlock->gateway); // acquire gateway lockwhile (rwlock->waiting > 0 || rwlock->writer) {
    // writers exist or wait - release gateway lock and wait for read_ok signal
    pthread_cond_wait(rwlock->read_ok, &rwlock->gateway);
    }rwlock->readers++; // increment readers
    pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
    }
    void rwlock_wlock(rwlock_t *rwlock) {
    pthread_mutex_lock(&rwlock->gateway); // acquire gateway lock
    rwlock->waiting++; // pending write
    while (rwlock->readers > 0 || rwlock->writer ) {
    // readers, writers exist - release gateway lock and wait for read_ok signal
    pthread_cond_wait(&rwlock->writer_ok,	&rwlock->gateway);
    }
    rwlock->waiting--; // remove writer from queue
    rwlock->writer = 1; // active writer == true
    pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
    }
    void rwlock_unlock(rwlock_t *rwlock) {
    pthread_mutex_lock(&rwlock->gateway); // acquire gateway lock// remove active writer or reader
    if (rwlock->writer) {
    rwlock->writer = 0;
    } else {
    rwlock->readers--;
    }// send signal: resume threads waiting to acquire
    if (rwlock->readers == 0 && rwlock->waiting > 0) {
    pthread_cond_signal(&rwlock->writer_ok); // no readers, waiting writer
    } else {
    pthread_cond_broadcast(&rwlock->reader_ok); // reader || no waiting writer
    }pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
    }

    The signal in the release function could also be sent outside the critical section, but it would change up the semantics (readers/writers can be changed by other threads after unlock).

    This way, we release the signal but keep the lock - therefore the signalled threads get blocked when they wake up.

    Correctness can be proven using invariants.

    Problem: Unfairness

    Threads acquiring write can starve threads wanting to acquire read.

    • Newer writer can starve older writer
    • Newer reader can acquire lock before older reader or writer

  • Barrier synchronization (with condition variables and locks) → inefficient

    Each thread executing <barrier> must wait until all or some number of threads have also executed <barrier> .

    typedef struct {
    int tc; // thread count
    pthread_cond_t barrier_ok;
    pthread_mutex_t barwait;} barrier_t;
    void barrier(barrier_t *b, int total_tc) {
    pthread_mutex_lock(&b->barwait); // acquire barwait lockb->tc++;
    if (b->tc == total_tc) { // threshold reached
    b->tc = 0;
    pthread_cond_broadcast(&b->barrier_ok);
    } else { // wait until barrier_ok signal (acquire and immediately release barwait)
    pthread_cond_wait(&b->barrier_ok, &b->barwait);
    }pthread_mutex_unlock(&b->barwait); // release barwait lock
    }

    Problems

    • Not scalable O(p)\small O(p)
    • Probably not safe on spurious (unexpected) wake ups

    Solution: Tree structured barrier with an extra flag

Barriers

Barriers

Each thread executing <barrier> must wait until all or some number of threads have also executed <barrier> .

Semaphores

Semaphore (by dijkstra)

concurrenct counter object with operations.

up(x) atomic increment → wakes x up if it was negative (signal, post, V)

down(x) atomic decrement → blocks x if it turned negative (wait, P)

If there is no thread in queue, the signals are not lost (contrary to condition variables).

Atomic operation

Atomic operation

Update to memory location or complex object that is atomic (can not be delayed, in one, uninterrupted):

  • Indivisible either the whole result or nothing is seen by other threads
  • Time bounded completed in finite time (wait-free, independent of other threads)

Usually done by hardware:

Require expensive hardware support and complex interaction with memory system (ie. cache-line invalidation).

  • load/stores atomic loads and stores (can be ordered in program)
  • ready-modify-write read, update, store
  • Swap swap two memory locations (based on condition)

Example: finding all primes in interval

  • Finding all primes in interval with atomic operation

    Find all primes between 2\small 2 and 109\small 10^9 .

    for (i=2; i<1000000000; i++) {
    if (isPrime(i)) printf("Found %d\n",i);
    }

    Solution 1) Statically scheduled data parallel loop

    Each thread executes block of size 109/p\small 10^9/p → load imbalance, only few processors perform expensive isPrime() checks.

    Problem:

    • Almost no speedup.
    • Array compaction cannot help since we dont know where primes are in advance.

    Solution 2) Shared global counter with global variable

    int i = 0; // shared globally
    // thread code (function)
    while (i<1000000000) {
    int j = i; // thread gets next value of i
    i++; // illegal concurrent update -> race condition
    if (isPrime(j)) printf("Found %d\n",j);
    }

    Problem:

    Unsafe because of multiple threads update i at the same time.

    • i++ is not atomic

      i++ actually translates to:

      tmp = i;
      tmp = tmp+1;
      i = tmp;

      This way we can have an unwanted interleaving where i is only incremented by 1 between two threads:

      										tmp = i;
      tmp = i;
      tmp = tmp+1;
      tmp = tmp+1;
      i = tmp;
      i = tmp;

    Solution 3) Shared global counter with mutex

    int i = 0; // shared global. init once
    pthread_mutex_t count_lock = …;
    // thread code (function)
    while (i<1000000000) {
    int j;
    pthread_mutex_lock(&count_lock); // critical section ::
    j = i;
    i++;
    if (isPrime(j)) printf("Found %d\n",j);
    pthread_mutex_unlock(&count_lock); // :: critical section
    }
    • Alternative: putting isPrime() out of the critical section
      int i = 0; // shared global. init once
      pthread_mutex_t count_lock = …;
      // thread code (function)
      while (i<1000000000) {
      int j;
      pthread_mutex_lock(&count_lock); // critical section ::
      j = i;
      i++; // thread could just stop here somewhere arbitrarily
      pthread_mutex_unlock(&count_lock); // :: critical section
      if (isPrime(j)) printf("Found %d\n",j);
      }

      Problem: thread can sleep arbitrarily and make no progress after acquiring the lock, while other threads just wait.

    Problem:

    No speedup → all the other threads can’t make any progress when isPrime() checks are in the critical section

    Solution 4) Atomic increment

    int i = 0; // shared global
    // thread code (function)
    int done = 0;
    do {
    int j = fetch_and_inc(&i); // return i, increment O(1)
    if (j==1000000000) done = 1;
    else if (isPrime(j)) printf("Found %d\n",j);
    } while (!done);

    Wait-free algorithm

Work-pools (master-worker pattern)

Useful implementation of higher-level parallel paradigm.

Master Maintains pool of work

Workers ask for work, execute, return new results to master (until done)

Centralized work-pool

Work-pool distributed pool of work nodes (function and arguments)

Scalability bottleneck

Getting work = explicitly asking master or accessing shared data structure

There is a competition for the work in the work-pool.

Implementation

Assume that work is executed in the order it was generated in (use deque data-structure as work-pool).

Each thread repeats until termination:

  1. Acquire mutex, check queue, if non-empty, take from front (else wait for condition variable).
  1. Execute work
  1. If new work generated: acquire mutex, insert at end of deque, wake up waiting threads

Work task as linked list:

typedef struct work {
void (*routine)(void *args); // function
void *args; // args
struct work *next; // next node in list
} work_t;

Problems with centralized work pool

  • Centralized resource

    Bad for scalability

  • Locks

    Thread updating shared resource can lock out all other threads indefinitely

Solutions

  • Local task queues

    thread primarily uses local queue → work stealing from other thread’s, when it’s empty

  • Lock-free/wait-free data structures

    Enabling a thread always to either make progress by itself or to ensure that others are also making progress.

    Make extensive use of strong atomic operations of type “compare and swap CAS”.

Decentralized work-pool

Each thread maintains local pool of work and executes work from own pool.

Work stealing local pool empty → steal work from other pools

Work dealing local pool full → distribute work to other pools

Randomized stealing

has good theoretical properties.

Tpar(p,n)=T\small \text{Tpar}(p,n) = T_\infin

Wpar(p,n)=O(T+n/p)\small \text{Wpar}(p,n) = O(T_\infin + n/p)

Deterministic stealing

Stealing from top of non-local queue.

Can provide good locality (cache) properties.