>> Well, hello.
What we want to do first is talk about
WordBench, and we're going to talk
about WordBench as a technology
based on either set or table.
WordBench really is an analytical
tool for design for text files,
and the idea is to collect information
on the text files looking for --
looking for all of the
words in that text file,
with "word" meaning a definition;
storing the unique words in a set
or a table along with the number of
times that word was found in the file
or multiple files that
you have searched;
and be able to generate
brief reports to screen
and more detailed reports
to file on that analysis.
So let's look at first
the API for WordBench.
It's going to be four functions:
ReadText, and that's going
to take a file name that is handed
to it as an fsu string object.
There's going to be a WriteReport which
is going to take an output file name
as a string object; and a
column -- a column width No.
1, and a column width No.
2. And I'll explain that in a
moment when we run an example.
You have a ShowSummary.
That's the one that prints
a summary to screen.
And then ClearData which clears
all the data out of the --
of the system -- of the
WordBench object.
So let's first look at
the way WordBench operate.
[ Typing ]
So I will build the application.
And notice that I have some
text files; I'm going to use --
well, first I'm going to use
the test text file, test.text,
and let me actually show you
what that file looks like.
That has just a few strange cases that
come up -- words with an apostrophe s,
words with a dot in middle, words that
end in a dot and then a space, or a --
I should be using the
word "string" here.
They're not necessarily words at this
point -- single quotes around something,
things that begin with a backslash,
double quotes around something,
various little specific instances
that end up being questioned
in the conversation about, well,
what are we going to mean by a word.
And so we'll run these various
things through the test program.
And I'm going to take a pause here.
I'm just trying to figure out why I kept
getting that pop-up from the Filmmaker.
But anyway, so we now have
wb.x, and we can read text,
and I'm going to make
it that test.text file.
And, of course, there're
not many words in there,
and mostly they're unique words,
but I'm going to write that file --
write that to test.text.out, and then
we can go to another linprog window,
and we can see what that looks like.
And there's what the report looks like,
and you see why there's
two-column-width parameters there.
So anyway, this shows you how some
of those things got processed.
One of the interesting things
is that 60 digits with a comma
in the middle got counted
as a word and that's
because really that represents a number.
Apostrophes followed by a
letter count as part of a word.
Dots followed by a letter or a
number will constitute a word.
Of course, that could be
a decimal number there,
but it could also be something
like a file name in a --
if you're writing about computer
programming or something.
So it would not be unthinkable to have
that end up being counted as a word;
whereas if that period is followed by
a space, that's the end of a sentence,
and so you would not count it
as part of the word and so on.
So with that demonstration in mind, I'm
going to remind myself of the interface,
so I'm going to clear the data.
So now I'm going to read
a file named "tiny.text."
I'm going to read a file
called "small.text."
I guess it's dot t-x-t.
And that might not be your
idea of small, but it's smaller
than the one that's called big.
So let's do a show summary,
and what you see here is
that we have two files currently read,
and they are in a comma-separated
list, those file names.
The total count of the words that
have been read and the current size
of the vocabulary, the size of the
vocabulary is defined to be the number
of unique words that
you have found in all
of the files listed under Current Files.
So let's do a Save again -- or
rather a Write, and we'll do that --
I'll just call it x.x,
that's going to be a file
that we can look at and it will --
So maybe I've managed to
make that pop-up go away.
So here's that file I just wrote,
and you can see that it's going
to be a significantly larger file, and
what you see is things that are words --
these are, of course,
appearing in alphabetical order,
and digits come before letters in
that order; so the first few things
in this file are going to
be some various numbers
that got pulled out of the text.
Then we start with the actual words, and
you'll see what you have is count word,
the unique word "a" is in those
two text files 3,143 times;
"able" is in there 40
times; abhorrent is
in there only 2 times; "about" is 211,
but abomination is only 1, and so forth.
So that's the kind of report
you're going to generate.
The slight misalignment here
is due to the fact that some
of these words are actually longer
than the column width that happened
to get set in there, but
that's nothing to worry about.
If you got tired of that, you could
just set the column widths to be bigger.
So that is in a nutshell how a
WordBench is supposed to operate.
What I want to do next is talk
about the underlying implementation.
So that -- so what you were witnessing
there is the API for WordBench,
and it really consists of just four
functions along with, you know,
the boilerplate of the
construct or destruct
or assignment operator and stuff.
And then Private kind of begins to
tell you how this is going to work.
So the first thing we
have is some typedefs.
We're going to define an FSU pair to be
our entry type, and that pair's going
to be a string object followed by an
unsigned long, so that's the word,
the unique word, and the number
of times it appeared in the text,
and then we're going to use
"less than" to be our predicate.
Less than for a pair, you
may recall, just simply looks
at the first coordinate of the
pair, so it's going to be defined
as alphabetical order
in the string objects.
Then we have different ways of
defining a notion of set type,
and right now when we compiled it, it
was using the UO vector set of base set.
We can go back and recompile it with
a UO list implementation of set,
and we can come along later and define
it with a red/black left-leaning tree
as an implementation of set.
So obviously, you pick
just one of these,
all of these set implementations
have the same API.
The set API, and so any
one of them should work.
We do have a little [inaudible].
Some of these are M versions,
multi-versions of set.
And of course, they will function
with a multi-set instead of a set
but because you're going
to be using insert,
insert's quite different behavior, and
so when you use an M, while it works,
it's not going to give
you the results you want.
It's going to insert a new pair every
time you have to update the frequency.
Still, it's an interesting
exercise to see how that works.
Now, there's only two actual
objects here in the private sector.
All of the rest of this
is just definitions.
And that is, you got a word set
which is a set of pairs, right,
and you have a list of infiles, input
files meaning that, and that's a list --
an fsu list of fsu string objects.
So those are the two
data things in the file.
So what I'm going to show you
next is that we can do WordBench
with a table instead of a set.
And here's the way that would work.
That's just simply using, in this
case, what we're going to call an
"ordered associative array."
So this is going to be WordBench 2
just to distinguish it from WordBench.
The behavior will be
identical to WordBench.
I guess I didn't even --
I didn't give you the -- oh,
yeah, here's the WordBench.
So the WordBench API's going to
be the same as it was before.
In fact, we're even calling it -- well,
I guess we're calling it WordBench 2.
You've got ReadText, WriteReports,
ShowSummary, and ClearData.
The implementation is a little bit
different, though, because we're going
to have separate definition
of KeyType and DataType
because we're using now a table
instead of a set of pairs.
So we're going explicitly mention
the KeyType and the DataType.
Of course, the KeyType is
going to be our unique word.
The DataType is going to be a number
which is the frequency
count for that word.
And we're going to maintain separately
now a count of the number of words
that get read out of all the files.
It's a little more difficult to
extract that from the OIA object.
I'll explain why in just a few minutes.
So we're going to have an
ordered associative array called
"frequency underscore" that's a
KeyType DataType so this is a table
or associative array instead of
a set that we're going to use.
We still have the list of file names.
We still have our cleanup function.
We really haven't gone into that, and
there's a lot of discussion of that
that will be on the discussion boards.
But cleanup is essentially what
defines what we mean by a word.
So you read a raw string out of
the file, you pass it to cleanup,
and what you get back is, by definition,
a word, and that will go into the set
or the associative array
as the case may be.
And so cleanup is essentially our
functional definition of what we mean
by a word that can be
extracted from a string object.
So the two ways -- two different ways
of implementing WordBench
are subtly different
but there really is not much
difference in the way in what you have
to do to make these things work.
The key to -- the key to using the
associative array looks like this.
Let's say you call a clean -- you're
reading along, you read a string object
out of a file, you clean that string
object up, this is what do you do --
do it for either homework
-- for your homework 4,
if the string object length is now
on zero, you may have cleaned it
up to nothing, it might have been
nothing but garbage characters.
Then you use the associative
array bracket operator --
remember, frequency is
our associative array.
We get the string key, and of course,
that will return the data associated
with that string key, and so
we increment that frequency
and also intermit the number
of words -- the raw word count.
The key to understanding this is
that use of the bracket operator,
and we're going to go into
that in the talk I give
on associative arrays
which will be coming up.