From: Michael Barker To: nolenet Date: Fri, 16 May 2008 Subject: [Nolenet] SAN issue -- root cause explanation Hello -- Our SVC vendor has gotten back to us with a root cause analysis for Tuesday's (5/15) SAN performance issue that degraded service on many of our campus-wide server systems (including JES, Bb, www.fsu.edu, VMWare hosts, Exchange, Active Directory, etc.). Short explanation... There is/was a race condition bug in IBM's SVC code that caused portions of the SVC's memory structures to lock such that they could not be flushed/released. The race condition occurs when two events occur simultaneously. Both events are normal, but the likelihood of simultaneity is extremely low, almost vanishingly small. In our case, the two events occurred simultaneously, as we are very lucky. We expect an interim fix from IBM next week. In the meantime, we can avoid recurrence of one of the events, thus ensuring that the race condition does not recur. Long explanation... We use a SAN Volume Controller (SVC) to manage several disk arrays, which comprise many terabytes of storge, and present them to the server hosts. What the SVC allows us to do organize the disks in the disk arrays into "managed disk groups" (which we arrange, per array, according to performance characteristics) and carve those into "virtual disks" which are then presented to the server hosts. The SVC consists of node pairs (for redundancy's sake to achieve high availability in case of failure of a node, and allow for upgrades). Each disk array is attached to more than one node, and this via redundant SAN directors. So, every server host has multiple paths to the SVC cluster (which has multiple nodes itself), which also has multiple paths to each disk array (and each disk array consists of arrays of disks with spares). Each SVC node has memory areas for caching and various other things. E.g., they are used for caching (transfers in and out of memory is orders of magnitude faster than in and out of physical disk); they are used when a virtual disk is migrated from one managed disk group to another. For instance, if we have a virtual disk that is on an old array and we want to move it to a new array, with no downtime to the host, the SVC lets us do this: the data are moved, utilizing the memory structures of the SVC, from one managed disk group (which is presented to the SVC from some disk array) to another managed disk group (which is presented to the SVC from some other disk array) -- when this is done, the virtual disk, which is the same as it ever was with no change so far as the server host is concerned, has been migrated from one managed disk group to another. This is a good thing, as we've moved the storage with no impact on the server host; it is something we do on a regular basis -- indeed it is one of reasons for the SVC even to exist and for us to use one. The memory structures of the SVC are used in multiple functions (internally, all handled by IBM's SVC code) and periodically are flushed as these functions are carried out. As part of our overall storage growth, disk i/o growth, and lifecycle replacement, we recently acquired a new SVC node pair and disk array. This SVC node pair was up and running for a handful of weeks with no aberrant behavior. The new disk array was up and running with no aberrant behavior. On May 3rd, we had a downtime to bring the SVC software to code levels that allowed us to add this new SVC node pair to the cluster, and would allow us to migrate virtual disks from across these nodes onto the new array. In the intervening days we created new managed disk groups on the new array; we migrated virtual disks onto them. All this went without a hitch. It is something we do on a regular basis. On Saturday May 10th, we migrated a virtual disk from a managed disk group onto an empty managed disk group coming from the new array. This was an initial step in moving one of our Tivoli Server Manager backup servers off of the old disk array we are replacing with the new one. So, doing this exercises the SVC's memory structures in a certain way. At the same moment, there was an event on one of the SVC nodes involved where it tried to destage its write cache (which also involves the SVC memory structures, of course). It was this very specific and simultaneous conjunction of events that exposed the code bug; we had done this several times already without negative impact. The likelihood of simultaneity is very low, so it was not exposed in IBM's testing -- so they do not have a fix in hand for the bug. But now that they've identified it, with help from our team, they are working on a fix, with an expectation that it might be forthcoming next week. The bug locked portions of one SVC node's memory structures, which eventually propagated to the other nodes in the cluster -- hence the gradual degradation of performance which manifested itself fully the morning of May 13th. The net effect was that we had the SVC write caching eventually stop because the memory structures were locked. When we restarted the SVC node around midnight May 14th, it cleared all the memory structures on the SVC. The good news is that none of our current set of managed disk groups is empty. IBM tells us that migrating a virtual disk into an empty managed disk group is one of the factors (that must occur simultaneously with the subsecond length of time the normal cache destaging occurs) for the race condition to occur. So, so long as we do not create a new empty managed disk group and migrate a virtual disk into it prior to the code fix, we should have no recurrence. The other good news is that IBM found the root cause and will provide us with a fix. _______________________________________________ Nolenet maillist