pthreads

Definition

Process

Program in execution

Thread

Used for thread parallel programming.

Smallest independent stream of instructions that can be scheduled by OS.

Each contain local variable stack, registers, thread local storage.

If scheduled by OS, then OS thread = lives inside an OS-process and shares global information (file pointers, global variables, static variables, heap storage).

POSIX threads, pthreads

POSIX = Portable Operating System Interface for uniX

Low level interface for thread programming in C/Unix

Have no cost model, no memory model, ...

Programming model

(p)threads programming model

Fork-join parallelism
A thread can spawn (start) any number of new threads and wait for them to terminate.

Master thread
There is initially only one active main thread.
Code for threads are functions. All threads execute within the same program (SPMD) but can be different functions.

Peer threads
A thread can wait for another “peer thread” to terminate.

Parallelism
Threads are scheduled by the underlying system and may or may not run simultaneously on different processors.

Synchronization between threads
There is no implicit synchronization between threasd - the progress independently and asynchronously.

Threads share process global data

Coordination mechanisms
for protecting access to shared variables (locks, condition variables).
All concurrent updates must be protected otherwise: illegal program with undefined outcome.

$\dots$

Pragmatics for parallel computing

Threads get scheduled and bound specified cores (non-standard).

Symmetric Multi-Processor SMP

💡

Don’t confuse symmetric multi-processor SMP with symmetric multi-processing SMP.

Single OS

all cores treated equally

threads not bound to cores

there can be more threads than cores

Explicit parallelization of data parallel loop

Increased performance by spawning threads recursively instead of using loops.

start = rank*n/p
end = (rank+1)*n/pfor (i=start; i<end; i++) {
a[i] = f(i);
}

rank is the virtual thread id and is set by the spawning thread.

implementation (in depth)

Using the argument struct

typedef struct {
int *array;
int n;
int rank;
int size;
} array_t;

loopblock(void *what) {
array_t *ix = (array_t*)what;
int *a = ix->array;
int i;
int n = ix->n;
int p = ix->size; // no. of threads
int start = ix->rank*(n/p);
int end = start+n/p;
for (i=start; i<end; i++) {
a[i] = f(i);
}
}

example: matrix-vector product
$\small \vec y = X \cdot \vec a$
Sequential solution
```
for (i=0; i<n; i++) {
y[i] = 0;
for (j=0; j<m; j++) {
y[i] += x[i][j]*A[j];
}
}
```
Parallel solution
- using cache ineffectively
  Thread with id rank rank handles every p ‘th index of matrix x
  for (i=rank; i<n; i+=p) { y[i] = 0; for (j=0;j<m; j++) { y[i] += x[i][j]*A[j]; } }
  But this way we use the cache ineffectively and each thread writes and reads into y at the same time.
```
for (i=rank*n/p; i<(rank+1)*n/p; i++) {
y[i] = 0;
for (j=0;j<m; j++) {
y[i] += x[i][j]*a[j];
}
}
```

Synchronization

Synchronization / coordination constructs (from concurrent computing) prevent malicious non-determinism (ie. race conditions).

Allow solving algorithms under weak memory consistency assumptions.

Race condition

Outcome of parallel program execution depends on the relative timing of the updates by processors/threads (in an undesired way).

Data race

Several threads (or cores, processes, …) compete for updating a shared variable.

Outcome depends on relative timing.

Deadlock

Two or more threads in situation where they depend on eachother and can’t progress.

Deadlocks spread to all threads and eventually the program can not terminate.

example

pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t lock2 = PTHREAD_MUTEX_INITIALIZER;

…
pthread_mutex_lock(&lock1);
pthread_mutex_lock(&lock2);
…

…
pthread_mutex_lock(&lock2);
pthread_mutex_lock(&lock1);
…

Shared resources

Simple variables, data structures, devices, ... accessible by multiple threads

Critical section

The manipulated shared resources by code, are not allowed to be manipulated by other entities (threads, processes, ...) concurrently.

Mutual exclusion property

There is at most one thread in critical section at any fiven time.

Safety

pthreads memory model for “safe” programs

Thread not allowed to update variables outside of critical sections.

(Lock constructs used to ensure consistent memory views)

Thread safe functions

Function can be executed concurrently by any number of threads and will always produce the correct result.

Non-thread safe functions have side effects
- Don’t protect (write access) to shared (ie. static) variables
- Keep a state over successive invocations
- Return pointers to static variables
- call thread-unsafe functions
  these unsafe functions can also be system functions:
  ie. rand() , random() keep internal state in static variables
  solution: using random_r() with an explicit state, locking, ...
  problem: serialization can lead to slowdown, deadlocks, ...

Example: concurrent read and thread-safety
Where x is the input.
```
a = f(x);
```
```
b = f(x);
```
```
c = f(y);
```
Conclusion: There is a concurrent read of x .
This program is only safe for concurrent execution, if f has no side effects (is thread safe ).

Wait free vs. lock free

Wait-freedom $\Rarr$ lock-freedom

(but lock-freedom $\not \Rarr$ wait-freedom)

Both require hardware support (ie. atomic counter).

Wait freealgorithm on concurrent data

Each thread can always do progress no matter what other threads are doing.

If thread tries to execute the operation, progress is guaranteed and does not depend on other threads or their attempt to execute the same operation.

Lock freealgorithm on concurrent data

If several threads are attempting to execute the operation, at least one of the threads can make progress on the operation.

Synchronization constructs

Slow implementation

These constructs were developed for small numbers of concurrent tasks, not scalable HPC (many threads on many cores).

Most of them have sequential parts that slow down (Amdahl).

Locks

= shared object between threads, guarantee mutual exclusion

called “mutex” in pthreads

States of locks

2 possible states:

acquired (locked)

released (unlocked)

Only one thread can acquire lock and must release it after use.

If another thread tries to acquire the locked lock, it is blocked and must wait until its release.

Properties of locks

Deadlock freedom Some thread will always succeed in acquiring the lock

Starvation freedom If a thread is trying to acquire the lock it will eventually succeed

Fairness First come first serve: thread that acquired first will get the lock first

Bounded waiting Bounded time / steps after which a thread must succeed in

acquiring the lock it is trying to acquire.

By definition, locks are not lock-free, not wait-free.

A thread that acquired the lock can keep it indefinitely while no other thread makes progress.

pthreads rule

Whatever thread $\small T_1$ can view in memory when releasing the lock,

thread $\small T_2$ can view when acquiring the lock.

Properties of pthread locks

Not fair, not starvation free, no bounded waiting - pthread locks are deadlock free but not starvation free.

Threads can starve eachother by never allowing the others to enter the critical section.

There is no guarantee that a thread will eventually acquire the lock it wants.

Lock pattern

flag = 1; x = … ;

lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

Example: using locks

unsafe program: race condition
initial value: x = 0;
```
if (flag) {
a = x;
flag = 0;
}
```
```
if (flag) {
a = x;
flag = 0;
}
```
```
flag = 1;
x = 27;
```
Conclusion: race condition - What happens depends on relative timing.
- Either a is some unintended value of x (before it was initialized) or 27.
- Both threads might enter critical secion

safe program (implements pattern)

what is the intended value of a for thread0, thread1?

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
x = 0;

lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

lock(&lock);
if (flag) {
a = x;
flag = 0;
}
unlock(&lock);

lock(&lock);
flag = 1;
x = 27;
unlock(&lock);

Conclusion:

Both read and write accesses to x now protected by lock (mutex).

Mutual exclusion, consistent memory ensured by locking.

unsafe program: deadlock

what is the intended value of a for thread0, thread1?

pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t lock2 = PTHREAD_MUTEX_INITIALIZER;
x = 0;

lock(&lock1); // executed
lock(&lock2); // <-- needs unlock of lock2 from thread1 to progress
if (…)
unlock(&lock2);
unlock(&lock1);

lock(&lock2); // executed
lock(&lock1); // <-- needs unlock of lock1 from thread0 to progress
if (…)
unlock(&lock1);
unlock(&lock2);

lock(&lock1);
flag = 1;
x = 27;
unlock(&lock1);

Conclusion: Deadlock

Deadlock occurs if thread0 and thread1 try to progress “simultaniously” by each alternately executing one instruction

Example: using global locks

List procesing: locking entire list
delete(e) operation:
1. lock(listlock)
1. scan sequentially until element e is found
1. dereference / free e from list via its predecessor
1. unlock(listlock)
This is inefficient: Thread using list blocks all other threads for $\small O(n)$ steps for list of size $\small n$ .

List procesing: locking each element individually
insert(e) operation:
We want to insert element e between e1 and e2 .
We will need to lock e1 and e2 for this.
delete(e) operation:
We want to delete element e between e1 and e2 .
We will need to lock e1 and e2 for this.
This is solution is expensive (needs a lot of space).
Deadlocks can still occur - We don’t know how the code for every thread is written → locks are not composable.
Assume the operations are written by 2 different programmers.
- Insert: first lock e1 , then e2
- Delete: first lock e2 , then e1
Then by executing them both a deadlock occurs

Example: using read/write locks

Reading arbitrary value
```
pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;
flag = false;
```
```
rdlock(&lock);
if (flag) {
a = x;
}
unlock(&lock);
```
```
rdlock(&lock);
if (flag) {
b = x;
}
unlock(&lock);
```
```
wrlock(&lock);
x = c;
flag = true;
unlock(&lock);
```
Conclusion
Thread0, thread1 can both enter their critical section simultaneously (read lock).
Thread2 is always alone in its critical section (write lock).
Therefore the thread0, thread1 either read the old x value ( == 0 ) or new value set by thread2.

Reading newly assigned value (deadlock possible) → can be solved with condiction variables
```
pthread_rwlock_t lock = PTHREAD_RWLOCK_INITIALIZER;
flag = false;
```
```
rdlock(&lock);
while(!flag) {}
a = x;
unlock(&lock);
```
```
rdlock(&lock);
while (!flag) {}
b = x;
unlock(&lock);
```
```
wrlock(&lock);
x = c;
flag = true;
unlock(&lock);
```
Same as above but threads wait until condition is met before continuing (because of the while loop).
Problem
Active wait is inefficient.
Deadlock is possible.
Solution
Condition variables

Using locks to avoid deadlocks

The best solution is avoiding locks altogether by replicating shared information.

Use one lock at a time

If not possible: always acquire and release in the same order (this is difficult to do with libraries, because we don’t know how they are implemented)
Can be done by assigning a level to each lock, sorting them via their IDs, using “trylocks” (if a lock can’t be aquired, release all locks, wait, try again)

Condition variables

See above for use-case example

Condition variables / Monitor

Allows thread to temporarily release lock and wait (suspend) for condition-signal with a condition variable .

Different from semaphore object (which is not part of standard pthreads)

Waiting for signal inside critical section on condition variable:

Thread acquires lock and enters critical section.

Thread suspends itself, relinquishes (temporarily releases) lock but stays suspended - and waits in critical section until the condition is met.

Thread is resumed (woken up) later by a signal from some other signaling thread.
It has again acquired the lock.

Mutual exclusion

Mutual exclusion still guaranteed.

Suspended thread is only resumed when signaling thread left critical section.

Risk of deadlock

if threads mutually wait on some condition but there are no signal-threads.

weird bug in pthreads

There can be unwanted wakeups by signals from the runtime-system without explicit wakeup from other thread.

Good practice: rechecking whether wait-condition is fulfilled - even after waking up.

Standard pattern

Thread waiting for condition (signalled thread)

lock(&lock);
// critical section ::while (!condition()) {
pthread_cond_wait(&cond, &lock); // wait and release lock temporarily
}
// thread has been signaled, condition holds, lock is acquired again// :: critical section
unlock(&lock);

Thread broadcasting signal (signaling thread) — can be done inside or outside the critical section

Inside
```
lock(&lock);
// critical section ::... // establishing condition
cond_signal(&cond); // signal if condition holds// :: critical section
unlock(&lock);
```
- Advantage: Establishing condition in critical section → therefore it should hold when suspended thread is resumed.
- Disadvantage: Suspended thread will block on mutex again - until signaling thread leaves the section.

Outside

lock(&lock);
... // establishing condition (in critical section)
unlock(&lock);cond_signal(&cond); // signal

Advantage: Suspended thread can immediately continue without blocking.

Example

Readers-writers lock (with condition variables and locks) → unfair

typedef struct {
int readers; // actice readers
int writer;	// (boolean) active writer
int waiting; // pending writes (writers in queue)
pthread_cond_t read_ok; // condition: reading allowed
pthread_cond_t write_ok; // condition: writing allowed
pthread_mutex_t gateway; // lock} rwlock_t;

void rwlock_rlock(rwlock_t *rwlock) {
pthread_mutex_lock(&rwlock->gateway); // acquire gateway lockwhile (rwlock->waiting > 0 || rwlock->writer) {
// writers exist or wait - release gateway lock and wait for read_ok signal
pthread_cond_wait(rwlock->read_ok, &rwlock->gateway);
}rwlock->readers++; // increment readers
pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
}

void rwlock_wlock(rwlock_t *rwlock) {
pthread_mutex_lock(&rwlock->gateway); // acquire gateway lock
rwlock->waiting++; // pending write
while (rwlock->readers > 0 || rwlock->writer ) {
// readers, writers exist - release gateway lock and wait for read_ok signal
pthread_cond_wait(&rwlock->writer_ok,	&rwlock->gateway);
}
rwlock->waiting--; // remove writer from queue
rwlock->writer = 1; // active writer == true
pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
}

void rwlock_unlock(rwlock_t *rwlock) {
pthread_mutex_lock(&rwlock->gateway); // acquire gateway lock// remove active writer or reader
if (rwlock->writer) {
rwlock->writer = 0;
} else {
rwlock->readers--;
}// send signal: resume threads waiting to acquire
if (rwlock->readers == 0 && rwlock->waiting > 0) {
pthread_cond_signal(&rwlock->writer_ok); // no readers, waiting writer
} else {
pthread_cond_broadcast(&rwlock->reader_ok); // reader || no waiting writer
}pthread_mutex_unlock(&rwlock->gateway); // release gateway lock
}

The signal in the release function could also be sent outside the critical section, but it would change up the semantics (readers/writers can be changed by other threads after unlock).

This way, we release the signal but keep the lock - therefore the signalled threads get blocked when they wake up.

Correctness can be proven using invariants.

Problem: Unfairness

Threads acquiring write can starve threads wanting to acquire read.

Newer writer can starve older writer

Newer reader can acquire lock before older reader or writer

Barrier synchronization (with condition variables and locks) → inefficient

Each thread executing <barrier> must wait until all or some number of threads have also executed <barrier> .

typedef struct {
int tc; // thread count
pthread_cond_t barrier_ok;
pthread_mutex_t barwait;} barrier_t;

void barrier(barrier_t *b, int total_tc) {
pthread_mutex_lock(&b->barwait); // acquire barwait lockb->tc++;
if (b->tc == total_tc) { // threshold reached
b->tc = 0;
pthread_cond_broadcast(&b->barrier_ok);
} else { // wait until barrier_ok signal (acquire and immediately release barwait)
pthread_cond_wait(&b->barrier_ok, &b->barwait);
}pthread_mutex_unlock(&b->barwait); // release barwait lock
}

Problems

Not scalable $\small O(p)$

Probably not safe on spurious (unexpected) wake ups

Solution: Tree structured barrier with an extra flag

Barriers

Each thread executing <barrier> must wait until all or some number of threads have also executed <barrier> .

Semaphores

Semaphore (by dijkstra)

concurrenct counter object with operations.

up(x) atomic increment → wakes x up if it was negative (signal, post, V)

down(x) atomic decrement → blocks x if it turned negative (wait, P)

If there is no thread in queue, the signals are not lost (contrary to condition variables).

Atomic operation

Update to memory location or complex object that is atomic (can not be delayed, in one, uninterrupted):

Indivisible either the whole result or nothing is seen by other threads

Time bounded completed in finite time (wait-free, independent of other threads)

Usually done by hardware:

Require expensive hardware support and complex interaction with memory system (ie. cache-line invalidation).

load/stores atomic loads and stores (can be ordered in program)

ready-modify-write read, update, store

Swap swap two memory locations (based on condition)

Example: finding all primes in interval

Finding all primes in interval with atomic operation
Find all primes between $\small 2$ and $\small 10^9$ .
```
for (i=2; i<1000000000; i++) {
if (isPrime(i)) printf("Found %d\n",i);
}
```
Solution 1) Statically scheduled data parallel loop
Each thread executes block of size $\small 10^9/p$ → load imbalance, only few processors perform expensive isPrime() checks.
Problem:
- Almost no speedup.
- Array compaction cannot help since we dont know where primes are in advance.
Solution 2) Shared global counter with global variable
```
int i = 0; // shared globally
```
```
// thread code (function)
while (i<1000000000) {
int j = i; // thread gets next value of i
i++; // illegal concurrent update -> race condition
if (isPrime(j)) printf("Found %d\n",j);
}
```
Problem:
Unsafe because of multiple threads update i at the same time.
- i++ is not atomic
  i++ actually translates to:
  tmp = i; tmp = tmp+1; i = tmp;
  This way we can have an unwanted interleaving where i is only incremented by 1 between two threads:
  tmp = i; tmp = i; tmp = tmp+1; tmp = tmp+1; i = tmp; i = tmp;
Solution 3) Shared global counter with mutex
```
int i = 0; // shared global. init once
pthread_mutex_t count_lock = …;
```
```
// thread code (function)
while (i<1000000000) {
int j;
pthread_mutex_lock(&count_lock); // critical section ::
j = i;
i++;
if (isPrime(j)) printf("Found %d\n",j);
pthread_mutex_unlock(&count_lock); // :: critical section
}
```
- Alternative: putting isPrime() out of the critical section
  int i = 0; // shared global. init once pthread_mutex_t count_lock = …;
  // thread code (function) while (i<1000000000) { int j; pthread_mutex_lock(&count_lock); // critical section :: j = i; i++; // thread could just stop here somewhere arbitrarily pthread_mutex_unlock(&count_lock); // :: critical section if (isPrime(j)) printf("Found %d\n",j); }
  Problem: thread can sleep arbitrarily and make no progress after acquiring the lock, while other threads just wait.
Problem:
No speedup → all the other threads can’t make any progress when isPrime() checks are in the critical section
Solution 4) Atomic increment
```
int i = 0; // shared global
```
```
// thread code (function)
int done = 0;
do {
int j = fetch_and_inc(&i); // return i, increment O(1)
if (j==1000000000) done = 1;
else if (isPrime(j)) printf("Found %d\n",j);
} while (!done);
```
Wait-free algorithm

Work-pools (master-worker pattern)

Useful implementation of higher-level parallel paradigm.

Master Maintains pool of work

Workers ask for work, execute, return new results to master (until done)

Centralized work-pool

Work-pool distributed pool of work nodes (function and arguments)

Scalability bottleneck

Getting work = explicitly asking master or accessing shared data structure

There is a competition for the work in the work-pool.

Implementation

Assume that work is executed in the order it was generated in (use deque data-structure as work-pool).

Each thread repeats until termination:

Acquire mutex, check queue, if non-empty, take from front (else wait for condition variable).

Execute work

If new work generated: acquire mutex, insert at end of deque, wake up waiting threads

Work task as linked list:

typedef struct work {
void (*routine)(void *args); // function
void *args; // args
struct work *next; // next node in list
} work_t;

Problems with centralized work pool

Centralized resource
Bad for scalability

Locks
Thread updating shared resource can lock out all other threads indefinitely

Solutions

Local task queues
thread primarily uses local queue → work stealing from other thread’s, when it’s empty

Lock-free/wait-free data structures
Enabling a thread always to either make progress by itself or to ensure that others are also making progress.
Make extensive use of strong atomic operations of type “compare and swap CAS”.

Decentralized work-pool

Each thread maintains local pool of work and executes work from own pool.

Work stealing local pool empty → steal work from other pools

Work dealing local pool full → distribute work to other pools

Randomized stealing

has good theoretical properties.

$\small \text{Tpar}(p,n) = T_\infin$

$\small \text{Wpar}(p,n) = O(T_\infin + n/p)$

Deterministic stealing

Stealing from top of non-local queue.

Can provide good locality (cache) properties.