Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Ugh - Ed's journal
Lurgy day yesterday and today.
Not 'I'm dying' level of ill, more a 'well, maybe some lemsip would be good' kind of ill. So, last night I prescribed myself pizza-therapy. It's good. Get a spicy pizza. Oreganos Mexicano pizza is perfect. And eat it, with some barbecue sauce, or mayonnaise, or both.

It's really very therapeutic.

But anyway, I was thinking that a bit of doing not very much, and an early night would be lovely. Sadly, this was not to be.

You see, earlier in the day (e.g. morning) we had a 'director fault' on one of our Symms. A director is a SCSI controller, for one of our drive control loops. This is not a big deal - it's a bit of hardware swapped out, but the biggest annoyance is arranging with security to get the engineer and the delivered part in the same place.

Engineer went on site, and I handed over to a collegue to go home at the usual time.

Sadly, it was not my day. You see, that director fault was the result of a fault on the fiber-channel loop on the back end. Two drives had decided that they were 'unhappy' and were going to have a bit of a sulk. Now, the idea of RAID is it's redundant disks. So one can fail, and you're fine. Even if two go wrong, then ... well, that's usually ok, because it's only if it's a matching drive, that you have a problem.

You can see where I'm going, can't you? Yes, it seems on this particular occasion, our luck was in. We got the long odds shot, of the two drives in question. They were a matched pair. Out of 144 total disks, the two that failed, were two that were sharing the same bit of 'redundancy'. Any of the other 143 drives, going down, would trigger more error lights, but still everything would have been fine. Actually, if we get the 'right' ones, we can actually lose something like 20 separate drives, and more still if the 'failure' interval is high enough that the hot spare drives have come online.

But anyway. Basically, that was an 'oops, your data is gone' moment. Cue some panic, and problems, as I (as current on-call dude) got to try and figure out which servers that meant had lost data. They hadn't lost _much_ data, but ... well, even a little bit means data corruption, which is ... probably actually even worse than completely gone.

Now, given mean time between failure, and probability of the two drives going down, the odds are really very low indeed.

This morning, it appears we've managed to get away with it - a 'chkdsk' on the filesystem was all that was needed to put it back together again, because of the 6 servers, they were all 'small' drives, and not used heavily - so the little bit 'missing' was unimportant.

However it's still disturbing, as that could have been much much nastier.

(Yes, we have backups. But recovering the data from tape is ... fairly straight forward, if a little timeconsuming. The real problem is the outage as you try and put stuff back together in the shape that it was - full restores are undesirable, and actually don't always work all that well)
Leave a comment