A document retrieval program
Due: 24 Sep 2009
Educational objectives: Experience implementing a self-organizing linked list and a simple vector class, solving problems using the above classes, and implementing and using templates.
Statement of work: (i) Implement a linked list class that self-organizes as specified below and (ii) a templated vector class. (iii) Implement a simple document retrieval program, which is a modification of that in assignment 1, but using the containers you implemented, instead of using STL containers.
Deliverables: Turn in a
makefileand all header (*.h) and cpp (*.cpp) files that are needed to build your software. Turn in your development log too, which should be a plain ASCII text file calledLOG.txtin your project directory. You will submit all of these as described in www.cs.fsu.edu/~asriniva/courses/DSFall09/HWinstructions.html.Requirements:
- Create a subdirectory called
proj2.- You will need to have a
makefilein this directory. In addition, all the header and cpp files needed to build your software must be present here, as well as theLOG.txtfile.- Your code should be well designed and object oriented. You should implement appropriate classes for the software. Your code should contain at least the following classes: (i)
Dictionary, which uses yourVectorclass to store words in the standard dictionary, (ii)Document, which stores the name of a document, and uses yourVectorclass to store words in the document and their weights (defined later), and (iii)DocumentSet, which storesDocumentobjects, using a self-organizing linked list.- You should implement a templated
Vectorclass in the file Vector.h (note the capitalization of the first letter inVector). The following features must be implemented: (i) a default constructor that initializes an array of size 2, (ii) a destructor, (iii) void push_back(const T &e), (iv) the operator[], and (v) int size( ) const. You may implement additional features, if you wish to. If you do not implement a copy constructor and an assignment operator, then you should prevent their use by making them private.
- You should implement a doubly-linked list class that self-organizes as described below. Your implementation may be specific for this application, instead of being a generic templated class. You are free to choose the specific features you wish to implement, but they should be reasonable. For example, you will certainly need to implement a method that lets you push_back Document objects into the list.
- Your software's main task is as follows. A user will give it queries consisting of a set of words. For each query, the software should give the most relevant documents that contain all words in that query.
- The software is run by the user on the command line, as follows:
Retrieve Filename-List, whereFilename-Listis a list of file names of zero or more ASCII text documents.- The software first analyzes each file given on the command line, in the order in which they appear. Details of the analysis are explained later. A Document object corresponding to each document is placed in the linked list in the order in which it is analyzed. For example, if the program is run as
Retrieve A.txt B.txt C.txt, then the front of the list will containA.txt, and the back of the listC.txt. The software then waits for a series of user actions, and responds to each user action as described below.Possible user actions and required software response:
a Filename: Analyze the ASCII text fileFilename, and place the corresponding Document object at the end of the self organizing linked list. If this file has already been analyzed previously, then remove the Document object corresponding to the previous analysis, before analyzing it again. If the file does not exist, then outputFile Filename does not existto standard output (not to standard error).q Word-List:Word-Listis a list of words (this is defined later) separated by one or more blanks. The software returns a list of all documents that containing all the words inWord-List. Each line of the output will first give the name of a document, followed by a blank, followed by the document'srelevance, which is a floating point number defined later. The documents are output in the order in which they are encountered in the self organizing linked list (that is, the document closest to the front first). If no document matches, then outputNo matching documentto standard output. If the query contains any word that is not valid (this is defined later), then outputInvalid queryto standard output, and don't process this query further. This command also causes self-organization as follows. If a document matches a query, then the corresponding object is moved two spaces closer to the front of the self organizing list. If fewer than two objects are ahead of it on the list, then it is moved to the front of the list.x: Quit the program.- Analysis of a document. In this process, the software will identify the set of valid words in the document. Each valid word will be given a floating point weight. The weight of a valid word is the ratio of the frequency of its occurrence to the total number of valid words in the document. You should store the weights of each valid word, of each document analyzed, in a suitable container.
A valid word is defined as follows. A potential word is defined as a sequences of adjacent characters in the input file, separated by any of the following delimiters: whitespace (blanks, tabs, and newlines) or any of the following
! ( ) - : ; " , . ? /. (A delimiter cannot be a part of a potential word.) A word is defined as a potential word that contains only alphabetic characters or apostrophe. For example, consider the following line of text:It is easy to find words, and also potential words. "asd34" files's. The following are words:It is easy to find words and also potential words files's. The following is a potential word, but not a word:asd34. A valid word is aword that occurs in the standard dictionary /usr/share/dict/wordsonlinprog. Note that the word should exactly match an entry in the dictionary, without even differences in capitalization. For example, if the dictionary contains the wordabandonbut notAbandon, then the latter is not a valid word.- Relevance of a document. A document is relevant to a query only if it contains each word in the query. If a document contains each word in the query, then its relevance is defined as the sum of the weights, in this document, of all the words in the query.
Sample file and executable: A sample executable is available in the ~cop4530/fall09/solutions/proj2 directory on linprog. The first person to find an error in our executable will get a bonus point!
Bonus points (5):
You may get up to 5 additional points for significant extra work, such as implementing more features, or providing a GUI interface. Please obtain feedback from us prior to doing this. If you wish to get bonus points, then please submit your work as usual, but send an email to the Ya Li. Ya Li will schedule a meeting with you, and you can demonstrate the special features of your software then.
Notes:
- Your program should not have any output other than those specified above.
- You should not use the STL
listorvectorclasses. You may use thestringandpairclasses. Please get my permission, by email, before using any other STL feature.- We will test your
Vectorclass on an entirely different application. So it is important for this class to be generic and exactly as specified.
Last modified: 18 Sep 2009