Memory and Caching

Definition

Modern terminology

Core

CPU / Processing element PE (also still sometimes called processor)

Processor

several cores that can execute a program

Multi-core processor
Processor with several cores physically on the same chip. Always shared-memory (typically caches).

Many-core processor
Usually the GPU is meant.

Multi-processor system

System with several processors, not necessarily on the same chip.

The processors can be multi-core and can have shared-memory or shared caches.

Programming model concepts

Thread, process program under execution

Task some piece of work to be executed by a process or thread

Shared-memory model

Shared memory models for architectures and machines

Naive shared memory model

Processors connected directly to memory (as in UMA or NUMA)

Read/write operations done independently (not PRAM, not lock-step, not synchronized)

Implicit assumption:

Outcome is an interleaving of the instructions.

Memory is consistent as defined by program order (read and write order).

Naive shared programming model

Processors execute processes/threads inside each process.
Model does not specify which processor (core) executes it and there may be more threads than processors (oversubscription).
Symmetric multiprocessing SMP OS binds thread to core.
Improved performance by pinnning Explicitly binding thread to core.
ie. for utilization of private caches
Parallel computing (dedicated system) Num. of cores = threads and some pinning.

Processes/threads are not synchronized.

Processes/threads exchange data through shared memory.

NUMA but entire memory is directly visible.

Cache

“Real” shared memory architecture

Cache (small and fast memory close to processor) and memory systems reduce memory bottleneck.

Modern processors use several different caches with different purposes and different levels.

Main memory GB space, >100 cycles per access

Cache Kb to MB space, 1-20 cycles per access

Cache as a problematic abstraction (illusion)

In single processor architectures the cache contributes towards sustaining the RAM abstraction and is successful but in parallel processor architectures the abstraction becomes problematic:

Cache coherency problem synchronizing caches: local copy of one cache gets updated

Memory consistency problem different views in threads of what should be the same data

Cache Granularity

Cache consists of cache-lines/blocks of memory, used when reading/writing from main-memory.

Typical size: 64Bytes (8 doubles, 16 ints)

Locality

Caches make sense because of the following algorithm-properties:

temporal locality locations in memory used are used several times in succession

ie. when using the same operand (address)

spacial locality locations in memory used are close to eachother

ie. when using different indices of an array

Multiprocessor/multi-core caches

Shared memory, cache coherent NUMA (ccNUMA) through memory network.

Several cores share caches at different levels.

Mapping data from memory to cache

Assume data $\small$ occupies as much space as a block or less.

fully associative data can be stored in any cache-line

set associative data can be stored in a small set of cache-lines
- in depth
  Memory: M[0, ..., n-1]
  Cache: Lines [0, ..., m-1]
  where $\small m \ll n$ and block size $\small s$ .
  k-way set associative means that the cache lines are grouped in m/k sets — each set with k cache-lines.
  M[i] is cached in set (i/s) mod (m/k)
  Usually m=2^r and s=2^r' for some r,r'>0
  Mod and div then can be translated into mask and shift.

directly mapped data can be stored in only one specific cache-line
- in depth
  Memory: M[0, ..., n-1]
  Cache: Lines [0, ..., m-1]
  where $\small m \ll n$ and block size $\small s$ .
  M[i] is cached in line (i/s) mod m
  M[0],…,M[s-1] go to line 0
  M[s],…,M[2s-1] to line 1
  …
  M[m],…,M[m+s-1] again to line 0
  …
  Usually m=2^r and s=2^r' for some r,r'>0
  Mod and div then can be translated into mask and shift.

Cache read/write-miss

A “cache-hit” is when the address is already in the cache and ready to be read.

Cache reading

Cache hit Read from cache

Read cache-miss Load block from memory into cache and read from cache

Cache writing

Cache hit Overwrite cache (and memory or wait until block is evicted)

a) Write-back cache

(typically also write-allocate)

block is passed to memory, when it’s being evicted from cache

b) Write-through cache

(typically also write-non-allocate)

each write to cache is immediately passed to memory

Write cache-miss Loading block from memory is optional

a) Write allocate

Load block from memory into cache, then overwrite

b) Write non-allocate

Overwrite memory directly

Types of misses

Cold cache

Cache hasn’t been in use and is therefore empty.

Cold miss every (read) memory access that is a cache-miss

Warm cache

Cache has been in use.

Capacity miss entire cache full - some lines must be evicted

Conflict miss specific line is full (not necessarily the whole cache) - must be evicted

Hit-rate, miss-rate

Memory references divided by number of instructions that hit or miss the catch.

Eviction (removal) from cache

When there is a capacity or conflict miss, the cache evicts line to make room for new one.

For direct mapping caches we use the mapping.

For associative caches:

LRU least recently used

LFU least frequently used

Random replacement

Cache impact on performance

Cache usually maintained in hardware (”free lunch”).

The cache impacts the sequential and parallel performance.

Example: Matrix-Matrix multiplication

Matrix-Matrix multiplication

$\small C = A\cdot B$ (associative, distributive but not commutative)

$\small C[i,j] = \sum_{0 \leq k < n} A[i,k] \cdot B[k,j]$

for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
C[i][j] = 0;
for (k=0; k<n; k++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}

Simple implementation based on mathematical definition (not the best).

Work $\small O(n^3)$

Complexity $\small O(m \cdot \sqrt m)$ where $\small m = n^2$

Interchanging loops impacts the performance

The 3 loops can be interchanged: $\footnotesize 3! = 6$ possible variations.

worst performance: $\small i$ as the innermost loop (jki, kji)
```
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
```
B[k][j] is independent of i .
Each iteration is 1 load for A[i][k] and 1 store for C[i][j] :
they both are cache misses (assuming cache can store 2 matrix rows at most).

medium performance: $\small k$ as the innermost loop (ijk, jik)
```
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
C[i][j] = 0;
for (k=0; k<n; k++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
```
C[i][j] is independent of k (update in register by compiler).
Each iteration is 2 loads for A[i][k] , B[k][j] and 1 store for C[i][j] :
each load of B[k][j] is a cache miss because the matrix is accessed in row order.

best performance: $\small j$ as the innermost loop (ikj, kij)
```
// initialize C[i][j] = 0;for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
```
A[i][k] is independent of j (update in register by compiler).
Each iteration is 1 load for B[k][j] and 1 store for C[i][j] :
Both are accessed in row order and the miss-rate is given by the cache-line size.
The spatial and temporal locality is exploited well.

This is because how C stores matrices.

Scanning the matrix in row order reduces the number of main-memory-accesses.

We load blocks of 8-16 elements into the cache.

If row > cache, next row forces eviction.

Parallelization

$\small i,j$ loops fully independent (data parallel).

$\small k$ loop should be in increasing order (unless commutativity of $\small +$ is exploited)

par (0<=i<n, 0<=j<n) {
C[i][j] = 0;
for (k=0; k<n; k++) {
C[i][j] += A[i][k]*B[k][j];
}
}

Recursive solution

Alternative solution: recursive

Recursively splitting matrices in half in both dimensions.

Each recursion level has 8 submatrix multiplications, 4 submatrix additions:

C00 = A00 * B00 + A01 * B10
C01 = A00 * B01 + A01 * B11
C10 = A10 * B00 + A11 * B10
C11 = A10 * B01 + A11 * B11

Calculated using the master theorem

Work $\small W(n) = 8 \cdot W(n/2)+ \Theta(n2 ) = \Theta(n^3)$

number of nodes in an matrix-multiplication DAG

Time $\small T(n) = T(n/2)+ \underbrace{O(\log n)}_{\text{recursion depth}} =\Theta(\log(n)^2)$

length of the longest path

Cache behavior

Good - can be made cache oblivious (Good cache behavior is independent of actual cache size).

Can be made cache-aware : multiplying smaller matrices of size $\small k' \times k''$ so that they fit in cache (based on cache hierarchy).

Implementation in C

Two dimensional arrays are pointers:

ie: A[0][0 ... n-1] is between the pointers A[0] and A[1] .

We therefore represent each submatrix as A[m0 ... m1][k0 ... k1] .

Result matrices ie. c0 have size (m1-m0)x(n1-n0) .

For the recursive calls submatrices are computed using offsets mo , no .

The recursion is stopped when the smallest matrix has size m1-m0 < CUTOFF (determined experimentally).

void MM(
double A[][], double B[][],
int m0, int m1, int k0, int k1, int n0, int n1,
double C[][],
int mo, int no // offset in c0, c1
) {
if (/*n0,n1,k0,k1,m0,m1 too small: CUTOFF*/) {
BaseMM(A,B,m0,m1,k0,k1,n0,n1,C,mo,no); // call other function with all args} else {
double c0[m1-m0][n1-n0]; // output of top half of B
double c1[m1-m0][n1-n0]; // output of bottom half of B// A (left-top quad) vs. B (top half) --> C0 (top half)
MM(A, B, m0, (m0+m1)/2, k0, (k0+k1)/2, n0, (n0+n1)/2, c0, 0, 0);
MM(A, B, m0, (m0+m1)/2, k0, (k0+k1)/2, (n0+n1)/2, n1, c0, 0, (n1-n0)/2);// A (right-top quad) vs. B (bottom half) --> C1 (top half)
MM(A, B, m0, (m0+m1)/2, (k0+k1)/2, k1, n0, (n0+n1)/2, c1, 0, 0);
MM(A, B, m0, (m0+m1)/2, (k0+k1)/2, k1, (n0+n1)/2, n1, c1, 0, (n1-n0)/2);
// A (left-bottom quad) vs. B (top half) --> C0 (bottom half)
MM(A, B, (m0+m1)/2, m1, k0, (k0+k1)/2, n0, (n0+n1)/2, c0, m1-m0, 0);
MM(A, B, (m0+m1)/2, m1, k0, (k0+k1)/2, (n0+n1)/2, n1, c0, m1-m0, (n1-n0)/2);// A (right-bottom quad) vs. B (bottom half) --> C1 (bottom half)
MM(A, B, (m0+m1)/2, m1, (k0+k1)/2, k1, n0, (n0+n1)/2, c1, m1-m0, 0);
MM(A, B, (m0+m1)/2, m1, (k0+k1)/2, k1, (n0+n1)/2, n1, c1, m1-m0, (n1-n0)/2);// C = c0 + c1 (using offset)
for (i=m0; i<m1; i++) {
for (j=n0; j<n1; j++) {
C[mo+i][no+j] = c0[i-m0][j-n0] + c1[i-m0][j-n0];
}
}... free memory of c0 and c1
}
}

Call with MM(A,B,0,m,0,k,0,n,C,0,0);

All MM calls are independent.

Cache coherence

Example: caches that are not coherent

Core 0, 1 have private caches and both loaded address $a$ .

Core 0 writes to $a$

Core 1 wants to read $a$ but has a different value

Definition: cache coherence

Let order of memory accesses to specific address $a$ be given by program order.

If processor $\small P$ writes to $\color{pink} a$ and no other writes occur (by any processor) before its next read of $\color{pink} a$ - then it must read the value it first wrote to $\color{pink} a$ .

If $\small P_1$ writes to $\color{pink} a$ and no other writes occur (by any processor) afterwards until $\small P_2$ reads $\color{pink} a$ - then $\small P_2$ must read the value $\small P_1$ first wrote to $\color{pink} a$ .

If $\small P_1$ and $\small P_2$ write to $\color{pink} a$ at the same time, then only one write is stored at $\color{pink} a$ .

More formal definition
1. If processor $\small P$ writes to $\color{pink} a$ at time $\small t_1$ and reads $\color{pink} a$ at $\small t_2>t_1$ , and there are no other writes (by $\small P$ or other) to $\color{pink} a$ between $\small t_1$ and $\small t_2$ , then $\small P$ reads the value written at $\small t_1$ .
1. If $\small P_1$ writes to $\color{pink} a$ at $\small t_1$ and another $\small P_2$ reads a at $\small t_2>t_1$ and no other $\small P$ writes to $\color{pink} a$ between $\small t_1$ and $\small t_2$ , then $\small P_2$ reads the value written by $\small P_1$ at $\small t_1$ .
  But there must be sufficient space between $\small t_1$ and $\small t_2$ until the updates become visible (there is no absolute time and nothing is immediate).
1. If $\small P_1$ and $\small P_2$ write to $\color{pink} a$ at the same time, then either the value of $\small P_1$ or the value of $\small P_2$ is stored at $\color{pink} a$ .
  Same time means “sufficiently close” in time - the writes must be “serialized”.

ccNUMA-Systems

(most multi-core and symmetric multiprocessing SMP nodes)

Cache coherence at cache-line-level, maintained by hardware-protocols.

This negatively impacts the performance: bus/network traffic, protocol overhead, transistors and power.

Note: Not all systems are cache coherent (ie. NEC Vector computers)

False sharing

Sharing / false sharing

Cache coherence is maintained at cache-line-level.

Two different addresses are written to cache - but the entire cache-line must be updated.

Avoiding false sharing

Simple, shared variables updated by different threads should be in different cache-lines.

Solutions:

Simple local variables → compiler might put them in different cache lines

Pad data-structures (ie. padded array) → wastes a lot of memory

Example 1) unnecessary update

$a,b$ are both in the same cache-line.

core0 updates $a$ in a loop, core1 updates $b$ in a loop.

Although $a,b$ are in different memory locations, each update will cause a cache coherency activity (although unnecessary).

Variables $a,b$ are called “falsely shared”.

Example 2) storing sum of rows: row-major

Storing row sums at index $\small 0$ in matrix: $\small a[j][0] = \sum_i a[j][i]$

Matrix is stored row-major in C.

Therefore all $\small a[i][0]$ are most likely in different cache lines.

The rows $\small j$ and $\small j +1$ are not spatially close and wont be updated based on eachother.

// a[m][n] two dimensional matrix
int *a = (int*)malloc(m*n*sizeof(int));// row order access
#pragma omp parallel for
for (j=0; j<m; j++) {
int *x = a+j*n; // get address of row
int i;
for (i=1; i<n; i++) { // sum all elements of row and store in [0]
x[0] += x[i];
}
}

Example 3) storing sum of rows: col-major

same problem as above

Matrix is stored col-major in FORTRAN.

Therefore all $\small a[i][0]$ are most likely in the same cache line and are updated $\small p\cdot n$ times.

// x[m][n] two dimensional matrix
int *a = (int*)malloc(n*m*sizeof(int));// row order access
#pragma omp parallel { // assuming p = m
int t = omp_get_thread_num(); // thread id
int i, j;
int *x = a;
for (j=t+p, i=1; i<n; i++,j+=p) {
x[t] += x[j];
}
}

implementation with variable

should have no false sharing (as the original FORTRAN implementation).

But compiler might put the variable in a different cache-line.

// x[m][n] two dimensional matrix
int *a = (int*)malloc(n*m*sizeof(int));// row order access
#pragma omp parallel {
int t = omp_get_thread_num(); // thread id
int i, j;
int *x = a;
register int sum = 0;
for (j=t+p, i=1; i<n; i++,j+=p) {
sum += x[j];
}
x[t] = sum;
}

NUMA memory performance

Memory (”von Neumann”) bottleneck in NUMA

The memory access times are non-uniform:

Memory closer to a core is faster than memory on another CPU

other problems
- Connection to memory via memory controllers (prefered by memory in bank)
- Less memory controllers than cores and they all share the same memory bandwidth

First touch

Applications try to allocate memory from closer to cores.

“First-touch” OS support:

“The first core (thread) to touch a virtual memory page will cause the virtual page to be allocated in memory close to that core”

This also means that threads that allocate later, are forced to use memory farthest away (remote instead of local).

Latency hiding techniques

These techniques hide latency, but bandwidth must still be suitable for memory requests.

Prefetching

Start loading operands well before use.

Multi-threading (via hardware or software)

switching between virtual processors when a thread requests data and coming back when data arrived.

(requires explicit parallel instruction computing EPIC)

Memory hierarchy and latency

Registers 0 cycles

L1 cache 1 cycle

L2 cache 10 cycles

L3 cache 30 cycles

Main memory 100 cycles

Disk 100.000 cycles

Tape 10.000.000 cycles

Super-linear speedup

Super-linear speedup is impossible — Previous proof: simulation argument

Simulation argument: linear/perfect speedup is the best possible.

Simulation argument assumes that parallel and sequential processors behave similarly = same memory behavior.

This is not true for real systems with a deep memory hierarchy.

Definition

single processor input size $\small n$ , has access to deep memory hierarchy

multiple processors input size $\small n/p$ , each have full access to main memory and cache

$\small \text{ref}(n)$ number of memory references

$\small \text{Mseq}(n)$ avg time per reference for a single processor (deep hierarchy)

$\small \text{Mpar}(p,n)$ avg time per reference for a single processor in parallel system (flat hierarchy)

If performance is defined by cost of memory references:

Sp(n)=ref(n)⋅Mseq(n)ref(n)/p⋅Mpar(p,n)=p⋅Mseq(n)Mpar(p,n)\small S_p(n) = \frac{\text{ref}(n) \cdot \text{Mseq}(n)}{\text{ref}(n)/p \cdot \text{Mpar}(p,n)} = p \cdot \frac{\text{Mseq}(n)}{\text{Mpar}(p,n)}Sp(n)=ref(n)/p⋅Mpar(p,n)ref(n)⋅Mseq(n)=p⋅Mpar(p,n)Mseq(n)

Example
$\small \text{Mseq}(n) = 1000$
$\small \text{Mpar}(p,n) = 100$ (independent of $\small n, p$ )
This would give us super a super-linear speedup
(wrong term, since its linear but greater than $\small p$ ).
$\small S_p(n) = 10 \cdot p$

Application performance and memory system

Bounds for applications (”roofline performance / estimate model”)

Memory bound time operating on data < time reading/writing data

Compute bound time operating on data > time reading/writing data

formal definition
Given Implementation $\small A$ on multi-core system:
performance $\small P$ = ie. in operations/s or other measure
operational intensity $\small O$ = average num of operations per byte read/written
memory bandwidth $\small B$
$\small A$ is memory bound, if $\small P/O > B$
$\small A$ is compute bound, if $\small P/O \leq B$

example
memory bound algorithms
Merge, prefix-sums → few operations per byte
less memory bound algorithms
Matrix-matrix multiplication → $\small O(n^3)$ operations on $\small O(n^2)$ data
Example: Assuming prefix sums performs
$\small O$ = 1/16 FLOP per Byte
1 FLOP per 8 Byte word read and written (16 Bytes)
$\small P$ = 2 GFLOPs
Then it is memory bound is bandwidth is < 32GByte/s.

If memory bound , speedup is limited by memory bandwidth (= How much faster can $\small p$ cores read/write than only one core?)

Program order and memory consistency

Execution-order is not always program-order.

Sequential program order

Sequential consistency

Parallel program order

$\small p$ processors execute some interleaving of parallel program (SPMD or MIMD).

Possible: sequential consistency, relaxed consistency

example

a = 1;
a = 2; // last assignment to a
b = 3; // last assignment to b

b = 7; // last assignment to b
a = 1;
a = 2; // last assignment to a

Possible outcomes:

a = 2 , b = 3

a = 2 , b = 7

Impossible outcomes:

a = 1 , b = 3

Consistency models

Memory consistency (Problem)

What view do threads have of the memory?

In what order do writes to different locations in memory become visible to other cores?

example
```
x = 0;
// … some code
x = 1;
if (y==0) {
// body
}
```
```
y = 0;
// … some code
y = 1;
if (x==0) {
// body
}
```
Question: can core0 and core1 both execute the body of the if-statement?
Answer:
No, because they each flip the other variable from 0 to 1 before being able to enter the body.
If both variables are flipped to 1 then no body is executed → therefore this is not a good lock-algorithm.
Assuming that x , y are not in cache of any core:
The answer is only valid under the assumption that the writes to the memory are not delayed - else we won’t get an interleaving of the two programs.

Sequential consistency

Outcome of parallel program = execution of some interleaving of the memory accesses of all processors .

Memory operations of each processor:

take effect immediately

performed in program-order

Easy to prove properties of programs or reason about correctness (ie. with invariants).

Is guaranteed by hardware (caches, write buffers, ... are logically transparent) .

Sequential consistency not guaranteed by modern multiprocessors because it reduces performance.
In modern multiprocessors (ie. x86, POWER, SPARC):
- Caches may delay writes
- Write buffers may delay and/or reorder writes
- Memory network may reorder writes
- Compiler may reorder updates

Relaxed consistency

loads = reads, stores = writes

May permit:

Reordering loads after loads

Reordering loads after stores

Reorderung stores after loads

Reorderung stores after stores

Weaker constraints on hardware.

More difficult to prove properties of programs or reason about correctness.

example
Initial value: flag = false
```
while (!flag) { }
a = otherval;
```
```
otherval = 42;
flag = true;
```
Under sequential consistency:
the outcome is a == 42
Under relaxed consistency:
Writes could be reordered and any old value could be stored in a .
Compiler won’t attempt move assignment to a before loop but might remove the loop altogether. (Declare boolean flag as volatile ).
This can be solved with a fence: Completes all writes executed before the fence and sets flag f .
```
while (!flag) { }
fence;
a = otherval;
```
```
otherval = 42;
fence(&flag);
```

Programming models for memory

Memory fences and memory barriers enforce pending operations to complete.

Fence = local operation for a core

Barrier = involves all processors

Memory fence

Programming model (memory model)

Synchronization/coordination construct to enforce a specific set of interleavings:

fence; flushes all write buffers.

example

x = 0;
fence; // so core1 can enter
// … some code
x = 1;fence; // so core1 can't enter
if (y==0) {
// body
}

y = 0;
fence;
// … some code
y = 1;fence;
if (x==0) {
// body
}