From: Michael Barker 
To: nolenet 
Date: Fri, 16 May 2008 
Subject: [Nolenet] SAN issue -- root cause explanation


Hello --

Our SVC vendor has gotten back to us with a root cause analysis for
Tuesday's (5/15) SAN performance issue that degraded service on many of
our campus-wide server systems (including JES, Bb, www.fsu.edu, VMWare
hosts, Exchange, Active Directory, etc.).  

Short explanation... 

There is/was a race condition bug in IBM's SVC code that caused portions
of the SVC's memory structures to lock such that they could not be
flushed/released.  The race condition occurs when two events occur
simultaneously.  Both events are normal, but the likelihood of
simultaneity is extremely low, almost vanishingly small.  In our case,
the two events occurred simultaneously, as we are very lucky.  We expect
an interim fix from IBM next week.  In the meantime, we can avoid
recurrence of one of the events, thus ensuring that the race condition
does not recur.

Long explanation...

We use a SAN Volume Controller (SVC) to manage several disk arrays,
which comprise many terabytes of storge, and present them to the server
hosts.  What the SVC allows us to do organize the disks in the disk
arrays into "managed disk groups" (which we arrange, per array,
according to performance characteristics) and carve those into "virtual
disks" which are then presented to the server hosts.  The SVC consists
of node pairs (for redundancy's sake to achieve high availability in
case of failure of a node, and allow for upgrades).  Each disk array is
attached to more than one node, and this via redundant SAN directors.
So, every server host has multiple paths to the SVC cluster (which has
multiple nodes itself), which also has multiple paths to each disk array
(and each disk array consists of arrays of disks with spares).

Each SVC node has memory areas for caching and various other things.
E.g., they are used for caching (transfers in and out of memory is
orders of magnitude faster than in and out of physical disk); they are
used when a virtual disk is migrated from one managed disk group to
another.  For instance, if we have a virtual disk that is on an old
array and we want to move it to a new array, with no downtime to the
host, the SVC lets us do this: the data are moved, utilizing the memory
structures of the SVC, from one managed disk group (which is presented
to the SVC from some disk array) to another managed disk group (which is
presented to the SVC from some other disk array) -- when this is done,
the virtual disk, which is the same as it ever was with no change so far
as the server host is concerned, has been migrated from one managed disk
group to another.  This is a good thing, as we've moved the storage with
no impact on the server host; it is something we do on a regular basis
-- indeed it is one of reasons for the SVC even to exist and for us to
use one.  The memory structures of the SVC are used in multiple
functions (internally, all handled by IBM's SVC code) and periodically
are flushed as these functions are carried out.

As part of our overall storage growth, disk i/o growth, and lifecycle
replacement, we recently acquired a new SVC node pair and disk array.
This SVC node pair was up and running for a handful of weeks with no
aberrant behavior.  The new disk array was up and running with no
aberrant behavior.  On May 3rd, we had a downtime to bring the SVC
software to code levels that allowed us to add this new SVC node pair to
the cluster, and would allow us to migrate virtual disks from across
these nodes onto the new array.  

In the intervening days we created new managed disk groups on the new
array; we migrated virtual disks onto them.  All this went without a
hitch.  It is something we do on a regular basis.

On Saturday May 10th, we migrated a virtual disk from a managed disk
group onto an empty managed disk group coming from the new array.  This
was an initial step in moving one of our Tivoli Server Manager backup
servers off of the old disk array we are replacing with the new one.
So, doing this exercises the SVC's memory structures in a certain way.
At the same moment, there was an event on one of the SVC nodes involved
where it tried to destage its write cache (which also involves the SVC
memory structures, of course).  It was this very specific and
simultaneous conjunction of events that exposed the code bug; we had
done this several times already without negative impact.  The likelihood
of simultaneity is very low, so it was not exposed in IBM's testing --
so they do not have a fix in hand for the bug.  But now that they've
identified it, with help from our team, they are working on a fix, with
an expectation that it might be forthcoming next week.  The bug locked
portions of one SVC node's memory structures, which eventually
propagated to the other nodes in the cluster -- hence the gradual
degradation of performance which manifested itself fully the morning of
May 13th.  The net effect was that we had the SVC write caching
eventually stop because the memory structures were locked.  When we
restarted the SVC node around midnight May 14th, it cleared all the
memory structures on the SVC.

The good news is that none of our current set of managed disk groups is
empty.  IBM tells us that migrating a virtual disk into an empty managed
disk group is one of the factors (that must occur simultaneously with
the subsecond length of time the normal cache destaging occurs) for the
race condition to occur.  So, so long as we do not create a new empty
managed disk group and migrate a virtual disk into it prior to the code
fix, we should have no recurrence.

The other good news is that IBM found the root cause and will provide us
with a fix.


_______________________________________________
Nolenet maillist