Server crashing: The saga continues - Ed's journal
Well, we look like we've got to the bottom of the problem with our NAS crashing.

We got official word back: That version of mpfs (multiprotocol filesystem) used by our EDM (backup server) is not compatible with the version of NAS code (the software running on kite, vulture and harrier). So our EDM _had_ been cacking NTFS permissions right left and centre on the box, which lead to our problems. E.G. after a certain level of corruption, it started just bombing out, and when it failed over to the standby, that, because it was using the same data, and had the same software version, responded in exactly the same way.

So over the weekend, there was an outage for an update, and over the weekend, filesystems were checked (which resulted in various people getting woken at odd hours to 'verify').

Partial bedlam and carnage on friday, which I also managed to completely miss.

So not only did I have a marvelous day on friday, at the all day barcrawl, I also managed to completely dodge all the problems over that 3 day period.
Which I would say I was upset about, but ... well I'd be lying through my teeth.

I'm sure my co-workers'll be happy that I'm not hogging _all_ the overtime and callouts :)

Now all we've got to do is tidy up the fallout from having 'broken' NTFS permissions all over the place. Thankfully it's a fairly simple 'one group permission per share' sort of setup, so it's not too bad to redo.

The good news though, is it's not something I should, or even could have been aware of as being a problem. The vendor in question is a rather apologetic about the failure of their change control processes.


