You may not assume anything about the number of words. However, the vocabulary file will typically contain around 100 words. Words will not be repeated. The file may or may not be sorted. Your data structure should work efficiently with a sorted file too. Therefore a plain binary search tree will not be efficient, since it will be similar to a link list in performance, then. A balanced tree will be acceptable. You may use other data structures, such as a hash table.
There is no limit on the size of a document. Words can, of course,
be repeated, and will generally not be sorted. Note that the
document will contain words outside the vocabulary too. You can ignore
these; however, you need to include these in your count of the total
number of words, since the frequency is defined as: Frequency =
(Number of occurrences of the word, in the document)/(Total number of
words in the document). Note that the total number of words includes
those not in the vocabulary. You can use wc -w
, if you
wish to, to count the number of words in a file.
You may assume the length of each query is less than 1000 characters. A query is terminated by a single new line, and will contain at least one word. There will be no newline between words of a single query. You can therefore assume that all the characters, until you encounter a newline, form a single query.
extern "C"
in /usr/include/stdio.h on program
. Then read about it, either in a C++ book, or on the web.
vocab1
and set the environment variable VOCABULARYLIST
to it. For example, I might do the following in bash
:
$ export VOCABULARYLIST=vocab1
vocab1
contains the following:
red
green
blue
file1
and file2
.
file1
contains the following:
i like red apples orange oranges
and blue skies
file2
contains the following:
red apples look nicer
than
red cars
0.111 0.000 0.111
0.286 0.000 0.000
$ query file1 file2
$ red apples blue blue sky
$ file1
$ red apple blue sky
$ file2
$ ^D
$
(1 0 2)
. Multiplying the matrix by this vector results in the vector (0.333 0.286)
, and therefore the first document has the largest value.
(1 0 1)
. Multiplying the matrix by this vector results in the vector (0.222 0.286)
, and therefore the second document has the largest value.