Theory
w=M⋅v
w
(
w
1
w
2
)=M(acbd)⋅
v
(e
f
)
There are 2 ways to calculate this:https://www.youtube.com/watch?v=ebdfJwHM5vo
Method 1)
w[i]=∑0≤j<nM[i][j]⋅V[i]
(acbd)⋅(ef)=⎝⎛
(
ab
)
⋅(e
f
)
(
cd
)
⋅(e
f
)⎠⎞
Method 2)
w[j]=∑c0≤i<c1M[j][i]⋅v[i]+∑c1≤i<c2M[j][i]⋅v’[i]+⋯+∑cp−10≤i<cpM[j][i]⋅v[i]
(acbd)⋅(ef)=e⋅(ac)+f⋅(bd)
and the same also applies to matrices but slightly adapted
(adbec
f
)⋅(
g
i
k
h
j
l
)=
(
g
h
)
⋅(ad)+
(
i
j
)
⋅(be)+
(
k
l
)
⋅(c
f
)
The sequential algorithm is always in in
O(n2)
.
We now want to use collective operations to parallelize this.
Solution 1:
Allscatter
Based on method 1
Tpar(p,n)=O(n2/p+n)
for
n>
lo
gp
where:
O(n2/p)
for local multiplication
O(n+
lo
gp)
for
Allgather
Linear speedup for
p≤n
.
Parallel algorithm
-
Distribute
n/p
rows of
M
and
v
to processors
Now process
k∈[0;p−1]
has some rows of
M
and parts of
v
.
k⋅(n/p)≤j′<(k+1)⋅(n/p)
where
j′
is a subset of all rows
j∈[0;n−1]
.
Therefore
j:=j′+k⋅(n/p)
.
-
Use
Allgather
to get back
v
from pieces
v′
on each process.
Now process has some rows of
M
and enitre
v
.
-
Calculate
M′⋅v
locally
w’[j]=
∑
0≤i<nM’[j][i]⋅v[i]
where
j
is different for each process (and they iterate through
i
locally).
Solution 2:
Reduce_scatter
Based on method 2
Tpar(p,n)=O(n2/p+n)
for
n>
lo
gp
where:
O(n2/p)
for local multiplication
O(n+
lo
gp)
for
Reduce_scatter
Linear speedup for
p≤n
.
Parallel algorithm
-
Distribute
M
column-wise, and
v
row-wise.
Processor
k∈[0;p−1]
has columns
j∈[ci;ci+1]
of
M
.
where
ci=k⋅(n/p)
and
n/p
rows of
v
.
-
Calculate
M′⋅v′
locally
∑
c
x
≤i<c
x
+1M’[j][i]⋅v’[i]
-
Calculate sum of the result vectors of each process
w′[j]=
∑
c0≤i<c1M’[j][i]⋅v’[i]+
∑
c1≤i<c2M’[j][i]⋅v’[i]+⋯+
∑
c
p
−10≤i<cpM’[j][i]⋅v’[i]
MPI_Reduce_scatter_block(partial,result,n/p,MPI_FLOAT,MPI_SUM,comm);
for (i=0; i<p; i++) counts[i] = n/p;
MPI_Reduce_scatter(partial,result,counts,MPI_FLOAT,MPI_SUM,comm);