CDA5125 Introduction to Parallel and Distributed Systems, Spring 2013
This course is supported in part by the Nividia CUDA Teaching center program
(See Nvidia press release in May 11, 2011).
Syllabus, Example Programs
- Lecture 2 Review
- Dependence, loop transformation, and
optimizing single thread performance
- Homework 1 (Due: Jan. 16, 11:59pm, Wednesday): Optimize the code in
to make it do the same thing but runs at least twice as fast.
The class should form two groups (2-3 people per group) to do all homework
- Homework 2 (Due: Jan. 23, 11:59pm): Optimizing the matrix multiply code
(naive_mm.c) for lingprog.
You should also report the float point operating rate in MFLOPS
of linprog with two benchmarks: naive_mm and your optimized routine with
Your code should meet the
following requirement: (1) it should produce correct results for all
matrix sizes (0 point otherwise; using -DCHECK to produce output for
it should achieve a speedup of 4 for most reasonably large matrix sizes;
and (3) for 1024x1024 matrix, the multiplication time should be less
than 2.1 second. The program should be compiled with -O3 flag when
- Lecture 3 Review
- Streaming SIMD Extension
- Intel C++ Intrinsics Reference. Here is a local copy.
- Homework 3 (Due: Jan.30, 11:59pm): Further Optimize your improved matrix
multiply code in homework 2 using SSE programming. You should also report the float point
operating rate in MFLOPS of linprog with your optimized routine with SSE
using 1024 by 1024 matrices. Your SSE matrix multiply routine should be
correct for all matrix sizes;
for 1024x1024 matrices, the multiplication time should be around 1 second.
The program should be compiled with -O3 flag when testing performance.
You may need to use the following SSE2 intrinsics: _mm_loadl_pd(),
_mm_loadh_pd(), _mm_storel_pd(), _mm_storeh_pd(),_mm_add_pd(),_mm_mul_pd().
Announcement: Midterm to be held on March 6. Topics cover from Lecture 1 to
Lecture 12. Midterm reading list
- Lecture 9 Review
- Scalable Parallel Computers
- Homework 4 (Due: Feb. 16, 11:59pm): Matrix multiplication code
with pthread AND OpenMP. At the beginning of each of the programs, write the
names of group members in the comment line. You should also report the
float point operating rate of linprog with your threaded and openMP routines
using 2048x2048 matrices. Your threaded and OpenMP matrix multiply
routine should be correct for all matrix sizes; for 1024x1024 matrices, the
time should be less than 0.3 second on linprog.
- Programming distributed memory systems: Message Passing Interface (MPI)
- Homework 5 (Due: April 10, 11:59pm): Matrix multiplication code
with MPI. At the beginning of each of the programs, write the
names of group members in the comment line. Your MPI matrix multiply routine
must be correct for any matrix size and number of processes; each MPI should
use O(Total_size / nprocs) memory space (no process should store any
whole matrix); the time for computing multiply of two 8192x8192 matrices should
be less than 105 seconds (with any number of processes) on linprog.
You should start from the mpi_mm_driver.c
program, and do not modify anything given. You should also try to reuse your
- Lecture 19 Review
- CUDA programming 2
- Homework 5 (Due: April 22, 11:59pm): Matrix multiplication code
with CUDA. At the beginning of each of the programs, write the
names of group members in the comment line. Your CUDA matrix multiply routine
must be correct for matrices of sizes 1024x1024, 2048x2048, and 8192x8192;
the time for computing multiply of two 8192x8192 matrices (single precision)
should be less than 8 seconds on one of the gpu machine.
Final exam on May 3, 10:00-12:00, in the classroom. You can bring
one 11'x8.5' cheat sheet to the exam.
Review for the final exam
Project presentation to be done on Friday April 26. Project report due
Tuesday April 30 at midnight.