This course is supported in part by the Nividia CUDA Teaching center program (See Nvidia press release in May 11, 2011).

Syllabus, Example Programs

Lecture 1

Lecture 2 Lecture 3- Lecture 2 Review
- Dependence, loop transformation, and optimizing single thread performance
- Homework 1 (Due: Jan. 16, 11:59pm, Wednesday): Optimize the code in simple.c to make it do the same thing but runs at least twice as fast. The class should form two groups (2-3 people per group) to do all homework and projects.
- Homework 2 (Due: Jan. 23, 11:59pm): Optimizing the matrix multiply code
(naive_mm.c) for lingprog.
You should also report the float point operating rate in MFLOPS
of linprog with two benchmarks: naive_mm and your optimized routine with
specific parameters.
Your code should meet the following requirement: (1) it should produce correct results for all matrix sizes (0 point otherwise; using -DCHECK to produce output for checking); (2) it should achieve a speedup of 4 for most reasonably large matrix sizes; and (3) for 1024x1024 matrix, the multiplication time should be less than 2.1 second. The program should be compiled with -O3 flag when testing performance.

- Lecture 3 Review
- Streaming SIMD Extension
- Intel C++ Intrinsics Reference. Here is a local copy.
- Homework 3 (Due: Jan.30, 11:59pm): Further Optimize your improved matrix multiply code in homework 2 using SSE programming. You should also report the float point operating rate in MFLOPS of linprog with your optimized routine with SSE using 1024 by 1024 matrices. Your SSE matrix multiply routine should be correct for all matrix sizes; for 1024x1024 matrices, the multiplication time should be around 1 second. The program should be compiled with -O3 flag when testing performance. You may need to use the following SSE2 intrinsics: _mm_loadl_pd(), _mm_loadh_pd(), _mm_storel_pd(), _mm_storeh_pd(),_mm_add_pd(),_mm_mul_pd().

Lecture 10

- Lecture 9 Review
- Scalable Parallel Computers
- Homework 4 (Due: Feb. 16, 11:59pm): Matrix multiplication code with pthread AND OpenMP. At the beginning of each of the programs, write the names of group members in the comment line. You should also report the float point operating rate of linprog with your threaded and openMP routines using 2048x2048 matrices. Your threaded and OpenMP matrix multiply routine should be correct for all matrix sizes; for 1024x1024 matrices, the time should be less than 0.3 second on linprog.

Lecture 13

Lecture 17- Programming distributed memory systems: Message Passing Interface (MPI)
- Homework 5 (Due: April 10, 11:59pm): Matrix multiplication code with MPI. At the beginning of each of the programs, write the names of group members in the comment line. Your MPI matrix multiply routine must be correct for any matrix size and number of processes; each MPI should use O(Total_size / nprocs) memory space (no process should store any whole matrix); the time for computing multiply of two 8192x8192 matrices should be less than 105 seconds (with any number of processes) on linprog. You should start from the mpi_mm_driver.c program, and do not modify anything given. You should also try to reuse your sequential implementation.

- Lecture 19 Review
- CUDA programming 2
- Homework 5 (Due: April 22, 11:59pm): Matrix multiplication code with CUDA. At the beginning of each of the programs, write the names of group members in the comment line. Your CUDA matrix multiply routine must be correct for matrices of sizes 1024x1024, 2048x2048, and 8192x8192; the time for computing multiply of two 8192x8192 matrices (single precision) should be less than 8 seconds on one of the gpu machine.

Final exam on May 3, 10:00-12:00, in the classroom. You can bring one 11'x8.5' cheat sheet to the exam. Review for the final exam

Project presentation to be done on Friday April 26. Project report due Tuesday April 30 at midnight.