Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Backups - Ed's journal
So, my most recent mission at work, has been backups. If you've ever really had to think about backups, then... well, it's one of those little floating chunks of ice on the surface, that hide an iceberg underneath.
What we're doing is database backups. For around 100 Oracle databases. And we're doing it on them all whilst they're 'hot' - all running, with no interference to the running system.
This is usefully large volume of data - of the order of 50Tb.
The objective is to be able to recover any of them within 4 hours with no data lost. Given they're live and active systems... well, it's non trivial to pull that off.

When talking about backups, you talk about Recovery Time Objective, and Recovery Point Objective.
RTO is the time you have to bring a system online after a failure.
RPO is how much data you're allowed to 'lose'. Whilst 'losing data' sounds scary, bear in mind that chances are most of the things you back up have some data loss - because everything you do _after_ the backup isn't backed up any more.

So an RTO of 4 hours, and an RPO of zero. (well, near zero) is pretty aggressive, given that you may need to call someone out, get people out of bed, etc.
What we're doing to achieve this, is use some storage array tricks. We've got two Symmetrix VMAX storage arrays, in two separate datacentres. On the 'primary' side, we're taking snapshot copies of the databases, at 4 hourly intervals through the day.
On the 'remote' side, we're taking a clone copy, and backing that up.

More storage terminology: A snapshot of a disk is a point in time copy of a disk. It's achieved my keeping track of changes. So initially, a snap is zero size, but every time you change something after that snap is taken, the change gets recorded. So you can quickly 'flip it back' to how it was.
A clone is a full copy of a disk, at a disk level - disk signatures, deleted files, the lot.
The advantage of these - as opposed to copying files - is they're actually fairly quick to finalize. Which means we can 'pause' our databases for a matter of a couple of minutes (or less) whilst we take our snap or clone, without anyone really noticing.

So what I've had to do, is write some scripts to make this happen - there are products that can do this, but because of the environment (And timescales) we're working with, they're not an option.
I've been writing a set of perl scripts that are run via a product called Tivoli Workload Scheduler (TWS). They run on Solaris, and 'do the business' of setting hot backup on a (remote) database, create a clone copy, mount the clone copy, and stream it to tape.

I've also been working on a scripted solution that does the snapshots - I'm quite proud of this, as it's not really a trivial matter to manage rotating snapshots - automatically, and on a large number of 'source' devices, which means you can't easily/feasibly do a 'brute force' approach (of defining all the individual relationships).

But ... well, we're coming up to 'go live' on the project, and are just dotting Is and crossing Ts. Last night, was the first time we'd done a recovery 'in anger' as it were (involving callouts, support etc.) and I was very pleased to find that we'd managed it, with 4 minutes to spare. (I'm expecting that to only get better, as we get things a bit more streamlined).

So it's all good, really. I'm kind of watching and prodding it as we go, and writing oodles of documentation, trying to clarify how it all fits together.
On the plus side - because I've known all along that this will be so - I've done my best to ensure that the scripts require minimal amounts of 'hardcore knowledge' to make them work. I think I'm mostly achieving that. I expect to find out that I'm wrong as I start doing knowledge transfer sessions, and doing 'early life support'.

But still, I've been quite enjoying this so far - it's a challenge that uses skills I've built up, with storage arrays, Solaris, Backups, Perl scripting (and XML parsing), and had to bootstrap myself into learning how Veritas Filesystem, Netbackup and TWS work.
IT's been a satisfying challenge.
(So far, at least. Expect a rage post in a week, when it all falls apart again).
5 comments or Leave a comment
jorune From: jorune Date: May 27th, 2011 11:13 am (UTC) (Link)
Good luck with your scripts.
gingerboy From: gingerboy Date: May 27th, 2011 05:28 pm (UTC) (Link)
Sounds like you're having fun there with all that :-)

We've only just got around to migrating off our CX500/700/3-40 that we migrated to when you were still here - and discovered that there's still one server with a SAN LUN mounted as /fish. How many years is it since you left and there's still things that are explained with simply "That was Ed" :-)
sobrique From: sobrique Date: May 27th, 2011 05:31 pm (UTC) (Link)
Hmm, that's almost certainly one of mine, yes.
I'm trying to remember now which one it would be.

Still, nice to know my legacy
gingerboy From: gingerboy Date: May 27th, 2011 07:52 pm (UTC) (Link)
BVGRUGS11 - the server which received the BV snapshots for synchronised backups across all BV servers. The only thing under /fish were backup tarballs from 2005, nothing actively using it :-)
sobrique From: sobrique Date: May 31st, 2011 08:50 pm (UTC) (Link)
Hmm, the irony is this is very nearly the same sort of thing. Somewhat larger environment, and a higher frequency of snap/clone, but .... basically the same operation.

Only this time, I'm the one that's doing the implementing. I _think_ I may have been working in the same guy in EMC too.
5 comments or Leave a comment