5570: Advanced Unix Programming

Hints and clarifications on Programming assignment 1

Clarification

Word size

All words will be less than 100 characters long. A word will a sequence of characters, delimited by white spaces (possibly multiple white spaces in documents and queries). You can treat upper and lower case letters as different characters. In practice, I will probably filter the files so that they contain only lower case letters (no punctuations, digits, etc).

Vocabulary list

You may not assume anything about the number of words. However, the vocabulary file will typically contain around 100 words. Words will not be repeated. The file may or may not be sorted. Your data structure should work efficiently with a sorted file too. Therefore a plain binary search tree will not be efficient, since it will be similar to a link list in performance, then. A balanced tree will be acceptable. You may use other data structures, such as a hash table.

Documents

There is no limit on the size of a document. Words can, of course, be repeated, and will generally not be sorted. Note that the document will contain words outside the vocabulary too. You can ignore these; however, you need to include these in your count of the total number of words, since the frequency is defined as: Frequency = (Number of occurrences of the word, in the document)/(Total number of words in the document). Note that the total number of words includes those not in the vocabulary. You can use wc -w, if you wish to, to count the number of words in a file.

Query

You may assume the length of each query is less than 1000 characters. A query is terminated by a single new line, and will contain at least one word. There will be no newline between words of a single query. You can therefore assume that all the characters, until you encounter a newline, form a single query.

Hints

C linkage in C++ programs: Check the use of extern "C" in /usr/include/stdio.h on program. Then read about it, either in a C++ book, or on the web.
Program design: You are free to choose your design. I might have used the following modules: (i) vocabulary, (ii) document, (iii) documentset, (iv) matrices, and (v) query. Each of these would have a header file specifying an interface, and a C file providing an implementation. You might also have a "utility" module to provide miscellaneous facilities.

Example

Let us assume that I created a created a vocabulary list in a file vocab1 and set the environment variable VOCABULARYLIST to it. For example, I might do the following in bash:
- Let us assume the the file vocab1 contains the following:
  red green blue
- I have document files called file1 and file2.
- file1 contains the following:
  i like red apples orange oranges and blue skies
- file2 contains the following: red apples look nicer than red cars
- The matrix for this set of documents will look as follows: 0.111 0.000 0.111 0.286 0.000 0.000
A sample session may run as follows: $ query file1 file2 $ red apples blue blue sky $ file1 $ red apple blue sky $ file2 $ ^D $ The first query is converted to the vector: (1 0 2). Multiplying the matrix by this vector results in the vector (0.333 0.286), and therefore the first document has the largest value. The second query is converted to the vector: (1 0 1). Multiplying the matrix by this vector results in the vector (0.222 0.286), and therefore the second document has the largest value. The user then types Control-D (which I denote by ^D) to indicate End Of File, and the program terminates.




Last modified: 9 Sep 2002