Web services and email

Two of the most popular services visibly provided by servers are email and web-type services.

Full email setups generally consists of an MTA such as sendmail or postfix, a delivery agent such as procmail or dropmail, a pop/imap server, and perhaps a webmail interface such as openwebmail, Outlook Web Access (OWA), horde, or squirrelmail.

They may also include various spam and virus programs, such as MailScanner, spamassassin, avis, clamav, dcc, razor, and many others, and other mail types of mail filters such as the popular milter library programs (e.g., milter-ahead).

Web services generally center around an Apache web server, some CGI-friendly regime such as Perl (anywhere from embedded Perl to mod_perl with any of the numerous CGI packages), Python, PHP, Ruby, JSP, ASP, and a database such as MySQL, Postgresql, Oracle, or SQLite. It may also include other bits such as SOAP or RSS services.

Email: sendmail

Sendmail functions as a MTA (and also a RFC 2476 MSA). It is generally configured to listen to port 25 (and 587 for MSA functions), and the configuration files are now generally stored in /etc/mail.

The primary configuration for administrators typically is /etc/mail/sendmail.mc This contains m4 directives to control the creation of /etc/mail/sendmail.cf

Sendmail is quite powerful. A common application for sendmail is to serve as a gateway mail server.

(You can also do this type of thing with Exchange; see Microsoft's website for a document called ``Using a Windows SMTP Relay Server in a Perimeter Network'' which gives an overview, and for details, look at ``How to Configure a Windows Server 2003 Server as a Relay Server or Smart Host''.)

One quite clever idea came from MailScanner's author, Julian Field at the University of Southampton. Email going into sendmail is put into a queue, and instead of the usual process of another sendmail process acting as a queue handler to deliver it, MailScanner first processes the mail (looking for spam and viruses, and comparing it against blacklists and whitelists), and then enqueues the message into a different queue directory for the second sendmail queue handler to find. (You can often view mail queues with the alias ``mailq'' which actually is ``sendmail -bp'' (or postfix's ``postqueue -f''.)

As we saw from the .mc files, sendmail doesn't actually do local delivery of email. Ordinary delivery is typically by procmail (other candidates include the old binmail program or dropmail.

procmail is a very powerful mail delivery agent; it can be configured to do many, many things. See http://www.procmail.org for ``recipes''. For instance, a typical procmail recipe might look like:


:0
* ^From: unpleasant@user.com
/dev/null

:0:
${DEFAULT}

Headsup: procmail is very picky about such items as colons. A single missing colon can be very bad since it might be one that indicates that a mailbox is to be locked before it receives a delivery -- and failing to lock a shared mailbox file might prove unpleasant.

Finally, you have to decide one (or perhaps two more) things about delivery: do you want email to go into a traditional ymbox, which is just one long file of email separated by the delimiter ^From .*\n or do you want to use the more modern maildir approach, where each email is written to a separate file? I think that the latter is preferable. If you do choose to go with mbox format, you will also have to make sure that your locking mechanisms for procmail, imap/pop, and any other client software such as openwebmail all agree to a common locking mechanism.

Main SMTP commands

HELO / EHLO MAIL FROM: RCPT TO: DATA QUIT

Maildirs, from Dr. Bernstein (see http://cr.yp.to/proto/maildir.html)

Maildirs are safer in many ways that the traditional mbox format. On USAH p. 549, the problems with traditional mailbox locking are discussed, as they are on the maildir webpage.

Maildirs keep every email message in a separate file, and never use any type of locking mechanism.

Traditional mailbox (mbox) format is not safe over NFS.

Every maildir setup will have the subdirectories tmp, new, and cur, and may have others. Mail is first delivered to tmp, then safely moved to new. It may have others, also.

     HOW A MESSAGE IS DELIVERED

          The tmp directory is used to ensure reliable delivery, as
          discussed here.

          A program delivers a mail message in six steps.  First, it
          chdir()s to the maildir directory.  Second, it stat()s the
          name tmp/time.pid.host, where time is the number of seconds
          since the beginning of 1970 GMT, pid is the program's
          process ID, and host is the host name.  Third, if stat()
          returned anything other than ENOENT, the program sleeps for
          two seconds, updates time, and tries the stat() again, a
          limited number of times.  Fourth, the program creates
          tmp/time.pid.host.  Fifth, the program NFS-writes the
          message to the file.  Sixth, the program link()s the file to
          new/time.pid.host.  At that instant the message has been
          successfully delivered.

           [ ... ]

          NFS-writing means (1) as usual, checking the number of bytes
          returned from each write() call; (2) calling fsync() and
          checking its return value; (3) calling close() and checking
          its return value.  (Standard NFS implementations handle
          fsync() incorrectly but make up for it by abusing close().)

imap and pop

dovecot: an increasingly popular imap and pop server is dovecot, which handles mbox and maildir format with aplomb. It also handles virtual users quite well, including those existing only in databases.

courier: also popular.

cyrus: uses its own mailbox format; it is more formidable to configure than other imap setups.

What is imap/pop? These are protocols that allow a user to remotely retrieve email from a mailhost. imap (RFC 3501), unlike pop (RFC 1939), supports the idea of separate folders on the server machine, and it has more functionality built in. Generally, you leave your mail messages on an imap server, and you retrieve them from a pop server.

The main commands for POP are

USER username
PASS password
LIST
RETR item
DELE item
QUIT
RSET

IMAP commands are ``tagged''. This means that you need to put a short, unique identifier before you use a command; the response to that command will use the same tag. The main commands for IMAP checking are

[tag] LOGIN username password
[tag] SELECT mailbox
[tag] LIST "" *
[tag] LOGOUT

Email Clients

There are two types of clients: (1) those that read email via a protocol such as IMAP, POP, or the ``Microsoft'' way, and (2) those that access mail via a filesystem.

Web clients: The very popular squirrelmail (http://www.squirrelmail.org) is an example of type (1) that uses IMAP. openwebmail (http://www.openwebmail.org) is an example of (2). It reads directly from either MBOX or Maildir format.

Dedicated interface clients: most of these now handle both file stores and IMAP/POP. Examples include Outlook, Thunderbird, Evolution, Sylpheed, Eudora, Pegasus, and a host of others.

Working on the latter setups can be interesting since the client can silently be going to entirely different machines also for its email.

I have worked on a setup where just determining where the client email was coming from required using tcpdump and lots of patience; in that case, a single user was having a problem accessing his mailbox: it turned out that the client interface (a very old version of a web email client) could not handle bad headers in email messages; it could not handle very large messages; and it was configured to terminate any handler that took longer than 30 seconds, so it could not ever handle a mailbox that had a large number of messages to move -- it used POP instead of IMAP, and thus ended up initially doing RETR, then DELE after it had pulled the messages into a maildir-like format.

Web services and email: web services

An important web service is simple delivery of html over http (hypertext transfer protocol).

The current version of http in use is 1.1, defined in RFC 2616. (There was an early stab at an http 1.2, but it didn't jell.)

Netcraft survey

The most popular webserver is the Apache webserver, with an overall 46% market share according to Netcraft's current webserver survey, and powers 67% of the most active sites.

http://news.netcraft.com/archives/web_server_survey.html

Apache has two versions, 1.3 and the 2.x versions, but 1.3 is now considered a ``legacy'' system and Apache now recommends:

``Apache 1.3.41 is the current stable release of the Apache 1.3 family. We strongly recommend that users of all earlier versions, including 1.3 family release, upgrade to to the current 2.2 version as soon as possible.''

http://www.apache.org/dist/httpd/Announcement1.3.html

Security Space survey

There is another webserver survey that uses a somewhat different methodology than Netcraft at Security Space

http://www.securityspace.com/s_survey/data/200902/index.html}

which you can view many different server statistics.

Conversations over http

What does a typical conversation look like? Here's one request and answer for a page:

Hypertext Transfer Protocol
    GET /rfcs/rfc2612.html HTTP/1.1
        Request Method: GET
        Request URI: /rfcs/rfc2612.html
        Request Version: HTTP/1.1
    Host: www.faqs.org
    User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) 
                Gecko/20060202 Red Hat/1.7.12-1.1.3.4.centos3
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;
            q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 300
    Connection: keep-alive


Hypertext Transfer Protocol
    HTTP/1.1 200 OK
        Request Version: HTTP/1.1
        Response Code: 200
    Date: Thu, 23 Feb 2006 16:26:31 GMT
    Server: Apache
    Last-Modified: Thu, 23 Feb 2006 07:01:53 GMT
    ETag: "5f8977-910a-43fd5de1"
    Accept-Ranges: bytes
    Content-Length: 37130
    Keep-Alive: timeout=5, max=100
    Connection: Keep-Alive
    Content-Type: text/html

Line-based text data: text/html
    < !DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
    < HTML>
    < HEAD>
    < TITLE>RFC 2612 (rfc2612) - The CAST-256 Encryption Algorithm< /TITLE>
    < META  name="description" content="RFC 2612 - The CAST-256 Encryption 
                 Algorithm">
    < script language="JavaScript1.2">
    function erfc(s)
    {document.write("< A href="/rfccomment.php?rfcnum="+s+"" 
     target="_blank" onclick="window.open('/rfccomment.php?rfcnum="+s+"',
     'Popup','toolbar=no,location=no,status=no,menubar=no,scrollbars=yes,
     resizable=yes,width=680,height=530,left=30
    //-->
    < /script>
    < /HEAD>
    < BODY BGCOLOR="#ffffff" TEXT="#000000">
    < P ALIGN=CENTER>< IMG SRC="/images/library.jpg" HEIGHT=62 WIDTH=150 BORDER="0" 
     ALIGN="MIDDLE" ALT="">< /P>
    < H1 ALIGN=CENTER>RFC 2612 (RFC2612)< /H1>
    < P ALIGN=CENTER>Internet RFC/STD/FYI/BCP Archives< /P>
    
    < DIV ALIGN=CENTER>[ < a href="/rfcs/">RFC Index< /a> | < A HREF="/rfcs/rfcsearch.html">
    RFC Search< /A> | < a href="/faqs/">Usenet FAQs< /a> | < a href="/contrib/">Web FAQs< /a>
    | < a href="/docs/">Documents< /a> | < a href="http://www.city-data.com/"
    < P>
    < STRONG>Alternate Formats:< /STRONG>
     < A HREF="/ftp/rfc/rfc2612.txt">rfc2612.txt< /A> |
     < A HREF="/ftp/rfc/pdf/rfc2612.txt.pdf">rfc2612.txt.pdf< /A>< /DIV>
    < p align=center>< script language="JavaScript">< !--
    erfc("2612");
    // -->< /script>< /p>
    < h3 align=center>RFC 2612 - The CAST-256 Encryption Algorithm< /h3>
    < HR SIZE=2 NOSHADE>
    < TT>
    
    Network Working Group                                        C. Adams
    Request for Comments: 2612                               J. Gilchrist
    Category: Informational                          Entrust Technologies
                                                                June 1999
    
                       The CAST-256 Encryption Algorithm
    
    Status of this Memo

Another HTTP Request

Here's a request and ``not modified'' answer for a page:

Hypertext Transfer Protocol
    GET /rfcs/rfc2616.html HTTP/1.1
        Request Method: GET
        Request URI: /rfcs/rfc2616.html
        Request Version: HTTP/1.1
    Host: www.faqs.org
    User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) 
                Gecko/20060202 Red Hat/1.7.12-1.1.3.4.centos3
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;
            q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 300
    Connection: keep-alive
    Referer: http://www.google.com/search?num=100&hl=en&lr=&q=http+protocol+rfc&btnG=Search
    If-Modified-Since: Thu, 23 Feb 2006 07:01:53 GMT
    If-None-Match: ``5f897b-63239-43fd5de1''
    Cache-Control: max-age=0


Hypertext Transfer Protocol
    HTTP/1.1 304 Not Modified
        Request Version: HTTP/1.1
        Response Code: 304
    Date: Thu, 23 Feb 2006 16:11:36 GMT
    Server: Apache
    Connection: Keep-Alive
    Keep-Alive: timeout=5, max=100
    ETag: "5f897b-63239-43fd5de1"

Encoding and chunking

While in theory encoding allows for any type of arbitrary encoding of the body, http level encoding in practice is used to allow a server to optimize its use of bandwidth by optionally choosing when it would like to compress or gzip a body.

Chunking is almost the reverse: it instead embeds redundant information into the message body to let the client make decisions about buffering and early rendering of data. If chunking occurs, it is usually for dynamically generated data.

Configuring Apache

Where you put your configuration data varies widely; while /etc/httpd is certainly common, you also might see /etc/apache2 and other places. Also widely varying is where you might find your actual html files, the ``documentroot''. On Redhat machines, /var/www/html has been the default directory.

On OpenSuse, you would see /srv/www/htdocs.

The most important configuration file is httpd.conf