Homework 2: Parallel prefix
The purpose of this assignment is to give you exercise on parallel prefix.
MIT students are welcome to use our Beowulf cluster or any other cluster
at their disposal.
Take a look at Beowulf instructions for an introduction on using our class Beowulf.
Use of the PBS system is mandatory .
Outline
You are reminded to start on the homework early. Late homework will not be accepted, no matter what your reason is.
Instructions for submission:
For our sanity's sake, please follow the submission procedure:
- Create a directory called 'your user name'-hw2
For example, cly-hw2
- Create two subdirectories, prob1, prob2
- Put your files (source code, txts) in the appropriate directory
- tar your directory by: tar cf 'your user name'-hw2.tar 'your user name'-hw2
For example, tar cf cly-hw2.tar cly-hw2
- gzip it:
gzip cly-hw2.tar
- Put this file in the root of your home directory,
cp cly-hw2.tar.gz ~
- The file will be collected at 2/25 11:59pm EST. This corresponds to 2/26 4:59am GMT, which the server uses.
Part I: MPI Scan
Part II: Factorization of tridiagonal matrices
Part I: MPI Scan
MPI_Scan computes the "partial reduction" of data, using a user-specified operation. For example, on 4 processors, given
[1 2 3] [4 5 6] [7 8 9] [10 11 12]
MPI_Scan with MPI_Op sum will return
[1 3 6] [10 15 21] [28 36 45] [55 66 78]
The above is WRONG. Read clarifications.
This is clearly parallel prefix for any associate operation MPI_Op. However, in LAM-MPI (which we are using), MPI_Scan is implemented in a serial way (proc 0 process the data, pass result to proc 1, ....).
In this question you are asked to implement MPI_Scan with parallel prefix, and benchmark it against the LAM-MPI MPI_Scan.
Implement:
void MyScan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )
where
sendbuf - starting address of send buffer
count - number of elements in input buffer
datatype - datatype of elements on input buffer
op - operation
comm - communicaor
recvbuf - starting address of receive buffer (which contains the results)
Implement the above function, and benchmark it against MPI_Scan on 4 processors, using a distributed vector of length from [400 4000 40000 400000 4000000] (meaning you have 1k, 10k, .. on each proc).
Explain what you see.
Hints/Clarifications:
- You only have to consider the case of MPI_SUM operation.
- MPI_Scan actually returns, on the given example, [1 2 3] [5 7 9] [12 15 18] [22 26 30]. It is the parallel prefix on 1 integer on each node, done 3 times (for each element).
- That means for your benchmark, you should have each node add all its elements (into one single integer) before calling MPI_Scan.
- MPI_Scan (MPI version) has np-1 synchronization steps. In order for parallel prefix to win, you need to have less than that.
- Consider only MPI_INT for now
Part II: Factorization of tridiagonal matrices
In this question you will implement the tridiagonal factorization, discussed in lecture 4, slide 16.
Implement the following function:
TriFact(double *a, double *b, double *c, double *d, double *l)
where
a - input diagonal values
b - input superdiagonal values
c - input subdiagonal values
d - output diagonal values on the U part
l - output subdiagonal values on the L part
Note that in the input matrix there are N diagonal values and N-1 super/subdiagonal values. Therefore length(a) would be different from length(b) and length(c) on at least some processors. Usually it means either local(a) on the 1st or the last process would be longer. You get to decide.
Some initial Send/Recv might be necessary to make this work. This is allowed.
Benchmark your program with N=[400 4000 40000] on 4 processors.
Hints/Clarifications
Ron Choy
Last modified: Sun Feb 9 22:32:13 GMT 2003