Questions and Answers
Jim McKinstry and Amy Rich
Here's a problem and solution that was sent to me by Bryce Nutter and Steve Harad of u1.net in Marlton, New Jersey. Please keep these coming. Thanks -- Jim.
We are running Solaris 2.5.1 with the latest patches on a Sun UE4500, with a Sun 5200 storage array, six 9-GB disks, and are running Veritas 3.0.3 with DMP. DMP is Dynamic Multi-Pathing, which provides host bus adapter failover, as well as load balancing across the host bus adapters. We ran a test where we did a tar of a large filesystem that was on the 5200 array. We pulled the fiber connection from the storage array while the tar was running. The time to "find and take" the alternate path was consistently 40-50 seconds; we tested this a half a dozen times. Our question was: Why is DMP taking so long to failover to the alternate path?
We tried setting the DMP restore interval (vxdmpadm start restore interval=10) to 10, 30, 60, and 500 seconds. These values had no effect on the failover time. It was always 40-50 seconds. This makes sense because this value defines the restore time. When we plugged the fiber connection back in, it took interval seconds to re-enable the path.
We worked with Sun, Veritas, and asked Jim McKinstry. The general consensus is that it does take at least a minute for it to failover. The reason is that the lost connection is not detected until the I/O request times out. This can be a "long" time. The error detection is passive (i.e., the DMP software does not actively poll devices for errors. It waits for I/Os to timeout and then searches for the failover device). Check the kernel driver for your FC card. It should have a timeout parameter. Don't set it too low or you will failover every time there is contention on the card/bus, which would really degrade your progress.
The details, based on the message logs, are as follows: When the cable is pulled, the ssd driver fails the I/O.
|