?

Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Server grief - Ed's journal
sobrique
sobrique
Server grief
Well, the NAS box finally sputterd into life when we restarted it at 4pm.
All good fun. Glad I can go home with a 'job complete'.

Of course, because we had 'useful' numbers of people unable to do anything most of the day, I'll be having to write an incident report.

I'm not worried there, asd really there wasn't anything else I could have done, and it wasn't my screwup. Actually, I don't think it was anyone's screwup, but it depends a lot on whether there's witch hunting going on.

Thankfully, I was able to restore stuff from last night's backup, so the factory didn't have to halt production. That could have sucked.

It went something like this.

Tuesday, I get in to work. I'm told that there'd been problems with this fileserver yesterday, a case had been raised with the vendor, and it was failed over onto the standby. I restored it to the primary, and all was ok.

Wednesday, it went splat again, and when restored, went splat once more after 2 hours.

Thursday, it was running on the standby, and around 50 users were 'having problems' with their files. Our vendor returned with the response that they were thinking that it was problems with filesystem corruption (not data, just filesystem) causing it to crash, and cause problems on the standby.

So we took the nas device offline at 2100 or so, to do a filesystem consistency check, and an acl check.
At 0730 the next morning, I got a call that it hadn't yet completed. We tried to bring the datamover online, but it just crashed again. We started remedial action, in repairing the filesystesm one at a time. By lunchtime, filesystems 1 and 4 were available again. 3 was available at 2pm. 2 couldn't be brought online without a reboot of the datamover, which after consultation with customer representatives we did at 4pm.

All fixed. From 'serious issues' being reported, the response was very quick, and worked until the problem was resolved. Unfortunately, with 800gb of data, in the form of lots of small files, this can take quite a bit of time.

Tags:

Leave a comment