Fixing a Perl "CGI-like" Script

Observation of the problem.

Recently, around 1:00am or so we started to see sporadic error messages from our "Graduate Student Tracker" (GST) scripts.

Since these scripts only run once an hour, and then merely update some cached web pages that would be otherwise quite expensive to regenerate on the fly (going back to an on-demand user model, the user response time would go from a few milliseconds to many seconds), I had guessed that this was a transient problem related to the database being refreshed during the early morning hours, and that the issue would resolve itself as the fury of early semester updates subsided.

However, this morning I saw that the errors were now happening every hour as of this morning's refreshing of the database. I decided that it would be worthwhile to investigate this although the user impact was still slight, with only the cached webpages now being a few hours stale.

Tell me again why we have this script?

The "Graduate Student Tracker" information is very useful to our faculty since it tells where our numerous graduate students have offices, the students' major professors, and so forth. This information is kept centrally, and our Linux/Apache webserver can access the data via a Microsoft system that provides it as a SOAP service. We have a number of these pages that we keep up.

The most common way to write a CGI script to display such information is to have it

Receive a request for a page.
Go out and query the database for all of the data.
Form a web page based on that data.

However step 2, querying the database, is painfully slow in this case, with the query scripts typically running 10 seconds or more. Since there are no arguments to step 1, the requests, and the information only changes slowly (particularly after the first week of classes) it makes a lot of sense to just generate the webpage occasionally and cache it.

Since the format and the flow of data retrievals are quite different for each webpage, each of the webpages has a separate generator script.

The scripts are run in an hourly batch process, and only after all of the scripts have successfully formulated a new page are these pages moved into place. If any one of the scripts fails, then the script exits and no webpages are updated. This is the safest strategy, and usually the right one since most failures are failures to connect to the database during its refresh period and if one script fails to connect, then all are very likely to do so.

Solving the problem

The first step was to observe the problem directly. I wanted to see the actual invocation, so I looked in the crontab for the scripts:

$ crontab -l                                                                                                              
# run the GST scripts every hour at 7 minutes after
7 * * * * ~/bin/wrapper.sh

Seeing that the batch script is in bash, we can run it by hand in "verbose" mode:

$ bash -x ~/bin/wrapper.sh
+ cd ~/bin
+ umask 022
+ dest=$DESTINATION_ROOT/gradsys/cached/
+ ~/bin/GST_advisor.pl
+ '[' 0 -ne 0 ']'
+ ~/bin/GST_phdcommittee.pl
+ '[' 0 -ne 0 ']'
+ ~/bin/GST_room.pl
+ '[' 2 -ne 0 ']'
+ echo 'error in GST_room.pl'
error in GST_room.pl
+ exit 1

So the script that is failing is GST_room.pl. In order to observe the failure, we give it a run:

$ ./GST_room.pl                                                                                   
Entity: line 1: parser error : XML declaration allowed only at the start of the document
(much gibberish)

It appears desirable to run this in a verbose mode since the observed XML parsing error isn't very informative — we are not even getting a line number. Looking at the code,

#use SOAP::Lite ( +trace => 'all', readable => 1, outputxml => 1, );
use CGI::Carp qw/fatalsToBrowser/;
use CGI;
use CGI::Session;
use HTML::Template;
use Data::Dumper;
use SOAP::Lite;
use SOAP::XML::Client::DotNet;
use XML::SAX::ParserFactory;
use xmlleaves;

it appears that we can get a nice view of the conversation between the script and the Microsoft server just by re-enabling one of the commented lines (and suppressing the silent version):

use SOAP::Lite ( +trace => 'all', readable => 1, outputxml => 1, );
use CGI::Carp qw/fatalsToBrowser/;
use CGI;
use CGI::Session;
use HTML::Template;
use Data::Dumper;
#use SOAP::Lite;
use SOAP::XML::Client::DotNet;
use XML::SAX::ParserFactory;
use xmlleaves;

However, we don't want to modify the production script since it is going to be invoked every hour and we don't want to have a production script running and displaying verbose test data.

$ cp -i GST_room.pl GST_room-testing.pl

Editing and running the "-testing" version, we now get a flood of information. Printing out the value of the XML passed to the parser, we can see that the last valid one and the first invalid one, which is causing the stoppage, are quite different:

The last valid entry looks like this:

<wrapperwrapper>
<row>
<STUDENTOBJECTID>XXXXX</STUDENTOBJECTID>
<OBJECTSTATUS></OBJECTSTATUS>
<STATUSDESCR></STATUSDESCR>
<OBJECTDATE></OBJECTDATE>
<COMMENTS>104B LOVE</COMMENTS>
<SORTCODE>2</SORTCODE>
</row>
</wrapperwrapper>

The erroneous entry looks like this:

<wrapperwrapper>
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xml ns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetStudentObjectResponse ><GetStudentObjectResult>
<isOK>true</isOK>
<ErrorMessage/>
<Response>
<?xml version="1.0" encoding="ISO-8859-1"?>                                                            
<GST_Service>                                                                                                                                          
</GST_Service>
</Response>
</GetStudentObjectResult>
</GetStudentObjectResponse>
</soap:Body>
</soap:Envelope>
</wrapperwrapper> 
Entity: line 1: parser error : XML declaration allowed only at the start of the document

Okay, that's interesting. We have only some row data in the valid entry, but the invalid entry looks like the head of a full XML document.

Staring at the code to see what's going on, we see that what's happening is a five step process of rewriting the original XML data from Microsoft's somewhat unique style (i.e., the bizarre mixing of '<'/'>' and '<'/'>' willy-nilly) and then isolating the "row" lines.

      my $xml_string = $soap_client->results();
      $xml_string =~ s/&lt;/</g;
      $xml_string =~ s/&gt;/>/g;
      $xml_string =~ s/.*?<row>/<row>/s;
      $xml_string =~ s/<\/row>\s*<\/GST_Service>.*/<\/row>/s;
      $xml_string = "<wrapperwrapper>$xml_string</wrapperwrapper>";
      my $handler = xmlleaves->new();
      my $parser = XML::SAX::ParserFactory->parser(Handler => $handler);

Ah-ha. Look at step 3, which is intended to strip everything before the first <row> tag. But there is no <row> tag in the failing case, so step 3 has no effect at all, leaving all the header XML information in and confusing our poor parser which is expecting only row information.

So we need a to write a check to make sure that there are some rows to parse, otherwise just output a bland "NO OFFICE ASSIGNED" message:

      my $xml_string = $soap_client->results();
      $xml_string =~ s/&lt;/</g;
      $xml_string =~ s/&gt;/>/g;
      if( $xml_string =~ /<row>/ )                                                                                                             
        {
          $xml_string =~ s/.*?<row>/<row>/s;
          $xml_string =~ s/<\/row>\s*<\/GST_Service>.*/<\/row>/s;
          $xml_string = "<wrapperwrapper>$xml_string</wrapperwrapper>";
          my $handler = xmlleaves->new();
          my $parser = XML::SAX::ParserFactory->parser(Handler => $handler);
          ...
          $office = $parser->var{"office"};
        }
      else
        {
          $office = "NO OFFICE ASSIGNED";
        }

The upshot

What did we learn from this? Well, it's likely that today's string of error messages is not related to the occasional error message that we were seeing before, which is still likely from transient conditions where the database is not available due to refreshment — and we should be able to observe this over the next few days.

It's also likely that this same "corner case" is not handled in the other similar "GST" scripts, and it would behoove us to check to see if this is the case.