Homework 2: Parallel prefix

The purpose of this assignment is to give you exercise on parallel prefix.

MIT students are welcome to use our Beowulf cluster or any other cluster at their disposal.

Take a look at Beowulf instructions for an introduction on using our class Beowulf.

Use of the PBS system is mandatory .


You are reminded to start on the homework early. Late homework will not be accepted, no matter what your reason is.

Instructions for submission:

For our sanity's sake, please follow the submission procedure:

Part I: MPI Scan
Part II: Factorization of tridiagonal matrices

Part I: MPI Scan

MPI_Scan computes the "partial reduction" of data, using a user-specified operation. For example, on 4 processors, given

[1 2 3] [4 5 6] [7 8 9] [10 11 12]

MPI_Scan with MPI_Op sum will return

[1 3 6] [10 15 21] [28 36 45] [55 66 78]

The above is WRONG. Read clarifications.

This is clearly parallel prefix for any associate operation MPI_Op. However, in LAM-MPI (which we are using), MPI_Scan is implemented in a serial way (proc 0 process the data, pass result to proc 1, ....).

In this question you are asked to implement MPI_Scan with parallel prefix, and benchmark it against the LAM-MPI MPI_Scan.


void MyScan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )


sendbuf - starting address of send buffer
count - number of elements in input buffer 
datatype - datatype of elements on input buffer
op - operation
comm - communicaor

recvbuf - starting address of receive buffer (which contains the results)

Implement the above function, and benchmark it against MPI_Scan on 4 processors, using a distributed vector of length from [400 4000 40000 400000 4000000] (meaning you have 1k, 10k, .. on each proc).

Explain what you see.


Part II: Factorization of tridiagonal matrices

In this question you will implement the tridiagonal factorization, discussed in lecture 4, slide 16.

Implement the following function:

TriFact(double *a, double *b, double *c, double *d, double *l)


a - input diagonal values
b - input superdiagonal values
c - input subdiagonal values

d - output diagonal values on the U part
l - output subdiagonal values on the L part

Note that in the input matrix there are N diagonal values and N-1 super/subdiagonal values. Therefore length(a) would be different from length(b) and length(c) on at least some processors. Usually it means either local(a) on the 1st or the last process would be longer. You get to decide.

Some initial Send/Recv might be necessary to make this work. This is allowed.

Benchmark your program with N=[400 4000 40000] on 4 processors.


Ron Choy
Last modified: Sun Feb 9 22:32:13 GMT 2003