5570: Advanced Unix Programming

Programming assignment 1

Objectives

Understand the level of programming expertise expected in this course.
Use multiple files in your program, and use make to organize the compilation.
Use tools that improve programming productivity on Unix systems, such as debuggers.
Practice good program design, such as separating the interface from the implementation, facilitate testing the program for correctness, etc.
Use a few simple Unix system calls.

Deadlines

The program must be submitted to us by 6 pm, Friday, 13 Sep 2002.
You must show me your program design, and the outline of your algorithm, by Fri, 6 Sep 2002, if you want feedback from me.

Submission instructions

tar your files to a file called hw1.tar and email it as an attachment to mao@cs.fsu.edu, along with a cc to asriniva@cs.fsu.edu. Make sure that you do not include the executable, any object files, or core dump file!
We will run the following commands at the shell prompt, to run your program:
tar xvf hw1.tar
make
query file1 file2 ...
[queries]
The name of your executable must be query, which takes a few file names as its command line argument, as described below. The user will then input a few "queries", which should be handled as described below.

Description

You will write a program that accepts queries from the user, and then outputs the name of the document that best matches that query, from a set of documents given as command line arguments to the program.

Details:

Your program will read the environment variable VOCABULARYLIST, which we will set to the name of a file that will contain a list of words, one word per line. (In order to test your program, you too will need to create such a file, and set this variable to the appropriate file name.) Your program will read this list of words and create a data structure that permits efficient operations on this list, as needed by the rest of the program. Let us call this data structure Vocabulary.

Your program will read the files (documents) specified as command line arguments, and represent this set of documents as a matrix having d rows and v columns, where d = number of documents and v = number of words in the vocabulary. Each row of the matrix will represent a document, and each column will represent a word from the vocabulary list. The (i,j) th element of this matrix will be the frequency of word # j (of the vocabulary) in document # i. [Frequency = (Number of occurrences of the word, in the document)/(Total number of words in the document). Note that the total number of words includes those not in the vocabulary.]

The program will then read queries typed by the user, from stdin, handling one query at a time. Each query will be a list of words, including those from outside the vocabulary, terminated by a newline. The program will determine the document that best matches a query by the following process. It will first create a vector of length v from the query by making the i th component of the vector equal to the number of occurrences of the i th word (of the vocabulary) in the query. (This will enable the user to emphasize certain words by typing them multiple times.) The program will then multiply the document set matrix by this vector, and choose the document corresponding to the largest component of the resulting vector. The name of this document will be printed to stdout.

This process is repeated until the program encounters end of file.

Note:

Hints and reasons for using this procedure to select the best document will be provided later, in class. But meanwhile, please start working on the program design and then on the code.

Grading criteria

Your assignment will be graded by the criteria given below

Correctness.
Efficiency of your algorithms. Implementation of the data structure to store the vocabulary list is particularly important.
Good programming practices, such as using header files to provide the interface, only related functions being present in each file, using asserts to check validity of the program's assumptions, providing a function that automatically tests the correctness of each program module, reasonable variable names, comments explaining non-obvious aspects of the program, using guards in the header files to protect against multiple inclusions, facilitating use by C++ programmers by enabling C linkage, etc.
Program portability. Your program should work correctly on Linux and Solaris systems, and conform to ANSI specifications.
Makefile that has the correct dependencies, and recompiles only those files that need to be recompiled.
We will also ask you to demonstrate your program, and show your proficiency in using the debugger, and other productivity tools.

After untarring, you will obtain 3 directories, ex1, ex2, and ex3. Under each of these directories, you will find (i) a file called 'Vocab', giving the vocabulary list, (ii) files 'Doc[1-8]', which are eight documents, (iii) a file called 'Query', which lists queries followed by the document that best matched, and (iv) a file called 'Matrix', which gives the document matrix, followed by the query vector for each query.

Final test files:

http://www.cs.fsu.edu/~asriniva/courses/aup02/hws/tests.hw1.tar

Last modified: 23 Sep 2002