šŸ“Ž

Matrix-Matrix multiplication

seešŸ“Ž Matrix-Vector multiplication for detailed explaination

(abcdef)ā‹…(ghijkl)=(gh)ā‹…(ad)+(ij)ā‹…(be)+(kl)ā‹…(cf) \small \begin{pmatrix} a & b & c\\ d & e & f \end{pmatrix} \cdot \begin{pmatrix} g & h\\ i & j \\ k & l \end{pmatrix} = \begin{pmatrix} g & h\end{pmatrix} \cdot \begin{pmatrix} a \\ d\end{pmatrix} + \begin{pmatrix} i & j\end{pmatrix} \cdot \begin{pmatrix} b \\ e\end{pmatrix} + \begin{pmatrix} k & l\end{pmatrix} \cdot \begin{pmatrix} c \\ f\end{pmatrix} 

Sequential solution in O(n³)\small O(n³)

Assume A,B,C\small A,B,C are nƗn\small n\times n matrices.

and also n≫p,p∣n\small n \gg p, \quad p|n

MPI processes organised into pƗp\small \sqrt p \times \sqrt p matrices through MPI_Cart_create .

Each process (i,j)\small (i,j) contains submatrices A′,B′,C′\small A', B', C'

There are

p\small \sqrt p communicators for rows

p\small \sqrt p communicators for columns

Folklore: Blockwise algorithm

Analysis

Work: O(n3/p+n2/p)\small O(n^3 /p+n^2 /\sqrt p)

Communication: 2ā‹…(log⁔2(p)⋅α+β⋅(pāˆ’1)ā‹…n2/p) \small 2\cdot (\log_2( \sqrt p) \cdot α + β \cdot (\sqrt{p}-1) \cdot n^2 /p) 

Space: pā‹…n²\small \sqrt p \cdot n²

very inefficient - increases by factor p\small \sqrt p over sequential algorithm

Speedup: p\small p when p∈O(n2)\small p \in O(n^2)

q=p\small q = \sqrt p

Process (i,j)\small (i,j) computes Cij′=(Ai,0′,Ai,1′,…Ai,q′)ā‹…(B0,j′,B1,j′,…,Bq,j′) \small C'_{i_j} = \left(A'_{i, 0}, A'_{i, 1}, \ldots A'_{i, q}\right) \cdot \left(B'_{0, j}, B'_{1, j}, \ldots, B'_{q, j}\right) 

Algorithm

  1. Get data

    AllgatherAi,āˆ—\small A_{i,*} on rows

    AllgatherAāˆ—,j\small A_{*,j} on columns

  1. Locally compute (sequentially)

    Ci,j=āˆ‘1=0pāˆ’1Ai,1ā‹…B1,j \small C_{i, j}=\sum_{1=0}^{\sqrt{p}-1} A_{i, 1} \cdot B_{1, j} 

Scalable Universal Matrix-Multiplication Algorithm (SUMMA)

Analysis

Work: O(n3/p+n2/p)\small O(n^3 /p+n^2 /\sqrt p)

Communication: 2ā‹…pā‹…(log⁔2(p)⋅α+β⋅n2/p) \small 2\cdot \sqrt p \cdot (\log_2( \sqrt p) \cdot α + β \cdot n^2 /p) 

Space: 2ā‹…n²/p\small 2 \cdot n² / p

Speedup: p\small p when p∈O(n2)\small p \in O(n^2)

Idea: Pipelining Allgather operations

Ci,j\small C_{i, j} gets computed in p\small \sqrt p communication rounds.

Main algorithm

In round l∈[0;pāˆ’1]\small l \in [0;~\sqrt p -1] we

  1. broadcoast Ai,1\small A_{i,1} in the row communicator
  1. broadcoast Bi,j\small B_{i,j} in the column communicator
  1. Update C+=Aā‹…B\small C \texttt{ += } A \cdot B