Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Server room joy - Ed's journal
Server room joy
Well, what happened was this.

Our aircon units are on our 'dirty' supply, not the 'clean' UPS supply. It's designed to fail over to the clean supply in an outage, restart and carry on running. (Our 'clean' supply is a flywheel generator affair, and it's relatively more expensive to run)
There was a powerspike in rugby at about 19:30. It wasn't enought to trigger a 'fail over' but it _was_ enough to cause our aircon controller modules (there are several independant ones, with sufficient overcapacity to compensate for a single failure) to crash.

And one of our servers did this:

When someone got on site, the computer room temperature was hovering around the 40-45 degree mark. And as you can see from that lil' graph, the system board and processor on that computer was getting up to the realms of 'bad news'. (The gap is the point where the server in question decided it had had enough and shut down)

Consequence? Things with thermal cutouts activated them and shut down. Things without either carried on, and survived, or stewed, crashed or burned out powersupplies or memory modules.

Dead boxes the following morning we had 6. 3 of which would just start up after reseating of 'stuff' and telling it to do a full test. The other 3 required assorted bits of hardware replaced.

And of course, the major casualties were hard disk drives. One server had 3 out of 4 'dead' but mostly we had mirrored disks and only one went 'pop'. Problem is though, we've severely stressed all the hardware in our computer room. I think we can expect a notably increased failure rate for the next few times we restart any of them.

But all in all, thanks to a rapid response to several of the people involved (ironically, not including the guy in our department who was _actually_ on call) the level of disaster was 'couple of hours down, and a few things not quite right' rather than the complete disaster it could have been, had not one of my workmates noticed early enough. (Things that shutdown were just down. Things that didn't would have been dead by morning). Annoyingly one of the first things to fail was our systems monitoring server.

Sometimes these things just happen, but I'm pleased to see that all our contingencies kicked into play and the 'disaster' was well controlled.
Leave a comment