COP4342 - Fall 2016

Assignment #7: Email message filter

Simple email messages have a simple format: they start with a set of headers, then a blank line, and then a body.

Headers generally follow the form of /^[-a-zA-Z0-9_]+: .*$/ followed by zero or more continuation lines that start with whitespace (usually a tab) /\t.*/.

For instance, here's an email message:

$ cat /tmp/test/testfile1
Received: from mail.cs.fsu.edu (mail.cs.fsu.edu [128.186.120.4])
	by newmail.cs.fsu.edu (Postfix) with ESMTP id 06476175D4C
Received: by mail.cs.fsu.edu (Postfix)
	id 95D01F2DC4; Sat,  7 Jun 2008 03:54:40 -0400 (EDT)
Delivered-To: langley
Message-ID: <484A3E21.4090704@fsu.edu>
Date: Sat, 07 Jun 2008 03:52:01 -0400
From: Tom Kitterman
To: nolenet,
	OTC Help Desk Staff
Subject: [Nolenet] Mailman listserv website down
X-fsucs-MailScanner-SpamCheck: not spam, SpamAssassin (cached, score=-2.599,
	required 5, autolearn=not spam, BAYES_00 -2.60)
X-Spam-Status: No

Hi,
There's something wrong with the mailman listserv website on lists.fsu.edu.
This happened  when we moved it to the new hardware.  It's almost
4AM and I've run out of ideas on how to fix it at the moment so I'm 
going home
to get some sleep and try again tomorrow.

So for now that website is non-functional.  The mailman list software is 
processing
messages so this should mostly affect list owners.  Until we get it 
fixed list owners should
open a ticket through the help desk in the normal manner for any 
critical issues.

Sorry for the inconvenience.

Tom K.
_______________________________________________
https://lists.fsu.edu/mailman/listinfo/nolenet

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

You can find on 45.56.74.139 a test source directory /usr/local/filter-source, which contains the following test email files:

/usr/local/filter-source$ ls -li
total 1072
123537 -rw-r--r-- 1 root root  34240 Nov 22 15:46 file1
124500 -rw-r--r-- 1 root root  20085 Nov 22 15:45 file2
123099 -rw-r--r-- 1 root root  72006 Nov 22 15:45 file3
124499 -rw-r--r-- 1 root root 821938 Nov 22 15:45 file4
123543 -rw-r--r-- 1 root root  30543 Nov 22 15:45 file5
124533 -rw-r--r-- 1 root root  20260 Nov 22 16:57 file6
123541 -rw-r--r-- 1 root root  63112 Nov 22 15:45 file7
124536 -rw-r--r-- 1 root root  20605 Nov 22 16:57 file8

Your task is to write a program which accepts two options specified by -s and -d. The first option should let you specify a source directory like -s SOURCEDIRECTORY, and the second option should let you specify a destination directory like -d DESTINATIONDIRECTORY. Your program will then open the source directory specified by -s and examine all of the files at the first level (you don't have recurse into any subdirectories that you find) to see if the file has a subject header that indicates spam.

This is done by looking for the character strings [SPAM] or {SPAM} (capitalization matters) in the Subject: header (remember, headers are only found before the first blank line; file8 for instance has a line that matches at 266, but it is outside the headers area.)

If the Subject: header indicates spam, or if there is no Subject: header, then no further processing happens. This should happen for both test file file1 and file6, each of which has a subject line labeled as spam, but not for file8, which has a matching line, but the matching line occurs after the headers.

If the file is not spam and does contain a Subject: header, your program should create a file in the destination directory that has the same filename as the original file; the contents of the file should be only the body of the message, with none of the header lines at all.

Thus if you processed the above example /tmp/test/testfile like so:

$ cat /tmp/test/testfile1
Received: from mail.cs.fsu.edu (mail.cs.fsu.edu [128.186.120.4])
	by newmail.cs.fsu.edu (Postfix) with ESMTP id 06476175D4C
Received: by mail.cs.fsu.edu (Postfix)
	id 95D01F2DC4; Sat,  7 Jun 2008 03:54:40 -0400 (EDT)
Delivered-To: langley
Message-ID: <484A3E21.4090704@fsu.edu>
Date: Sat, 07 Jun 2008 03:52:01 -0400
From: Tom Kitterman
To: nolenet,
	OTC Help Desk Staff
Subject: [Nolenet] Mailman listserv website down
X-fsucs-MailScanner-SpamCheck: not spam, SpamAssassin (cached, score=-2.599,
	required 5, autolearn=not spam, BAYES_00 -2.60)
X-Spam-Status: No

Hi,
There's something wrong with the mailman listserv website on lists.fsu.edu.
This happened  when we moved it to the new hardware.  It's almost
4AM and I've run out of ideas on how to fix it at the moment so I'm 
going home
to get some sleep and try again tomorrow.

So for now that website is non-functional.  The mailman list software is 
processing
messages so this should mostly affect list owners.  Until we get it 
fixed list owners should
open a ticket through the help desk in the normal manner for any 
critical issues.

Sorry for the inconvenience.

Tom K.
_______________________________________________
https://lists.fsu.edu/mailman/listinfo/nolenet

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
$ bin/filter.pl -s /tmp/test -d filter-results/
$ cat filter-results/testfile1 
Hi,
There's something wrong with the mailman listserv website on lists.fsu.edu.
This happened  when we moved it to the new hardware.  It's almost
4AM and I've run out of ideas on how to fix it at the moment so I'm
going home
to get some sleep and try again tomorrow.

So for now that website is non-functional.  The mailman list software is
processing
messages so this should mostly affect list owners.  Until we get it
fixed list owners should
open a ticket through the help desk in the normal manner for any
critical issues.

Sorry for the inconvenience.

Tom K.
_______________________________________________
https://lists.fsu.edu/mailman/listinfo/nolenet

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

When you run your program over the directory /usr/local/filter-source, the resulting destination directory should look like:

$ bin/filter.pl -s /usr/local/filter-source/ -d filter-results/
$ ls -li filter-results/
total 920
124528 -rw-r--r-- 1 COP4342_test COP4342_test   3178 Nov 22 17:47 file2
124505 -rw-r--r-- 1 COP4342_test COP4342_test  54705 Nov 22 17:47 file3
124529 -rw-r--r-- 1 COP4342_test COP4342_test 804616 Nov 22 17:47 file4
123542 -rw-r--r-- 1 COP4342_test COP4342_test  12916 Nov 22 17:47 file5
124527 -rw-r--r-- 1 COP4342_test COP4342_test  47131 Nov 22 17:47 file7
124502 -rw-r--r-- 1 COP4342_test COP4342_test   4879 Nov 22 17:47 file8

(Note that file1 and file6 are not there because they were discovered to be spam messages; note that the files that are in the results subdirectory are shorter than the originals since they no longer have any headers.)

Your Perl program be saved on your account on 45.56.74.139 in ~/bin/filter.pl so that I can test it. Please also submit the program on Blackboard by 11:59pm on Wednesday, November 30.