Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Backups - Ed's journal
I've just been told by 'directorial fiat' that the backups of our oracle cluster shall be stopped henceforth, because one of the apps is running a bit slow.
I don't like that.

We are tasked with looking for workarounds for the problem. The problem is that our backup data volumes have increased dramatically (we're moving around 2-3 Tb / night) and our network infrastructure hasn't been upgraded in 5 years. (We have 100mb to every desktop, and a 155mb backbone). So we are faced with wallpapering over cracks.
I don't like that either.

Oh, and the person having the problem whinged at his director, who has in turn whinged and our director, who has whinged at our management team, who have whinged at us. Which is good really, because that was the first we'd heard about it.
I don't like that either.
14 comments or Leave a comment
tenuous From: tenuous Date: January 20th, 2006 11:18 am (UTC) (Link)
Other than that, you're happy, though?
sobrique From: sobrique Date: January 20th, 2006 11:59 am (UTC) (Link)
Friday; beer soon.
jorune From: jorune Date: January 20th, 2006 12:01 pm (UTC) (Link)


No backups, are they mad?

Are they sure it's a network capacity issue? are they even sure what a network is?

Just when is this random act of management supposed to kick in?
sobrique From: sobrique Date: January 20th, 2006 12:07 pm (UTC) (Link)
Backups have already been stopped. :/

They are not sure it's a capacity issue. That's only what myself and my collegue have been saying; we haven't had a consultant in therefore we cannot have an official opinion on the matter.

We have a network bottleneck, that's running at 40-80% all day. We have a backup server that's (over) running at 100Gb/hour + for 14-16 hours per day. With 8 tape drives, any 'glitch' has notable knock on effects (such as a server with a bad network link, tying up a tape drive for 16 hours)
darkgodfred From: darkgodfred Date: January 20th, 2006 12:59 pm (UTC) (Link)
Is it wrong of me to suggest a data failure to demonstrate precisely why the God of Backups must be appeased?
jorune From: jorune Date: January 20th, 2006 01:04 pm (UTC) (Link)
Just what do they expect to happen when they need to go back to backup. I imagine the business's expectation will be - it'll just happen. What do your manager and director expect to happen? Knowing that each day that goes forward the business is not supported and stands to lose all their effort.

Presumably your manager and the director have gone back to the business and told them that if they want a faster network/no backups then they will have accept the risk, in writing. Or would that be putting too much trust and credibility in the management
sobrique From: sobrique Date: January 20th, 2006 02:03 pm (UTC) (Link)
I have overused the:
If a car park is full, it is not faulty, it is just full.
If a road is congested, it is not faulty, it's just busy.
jorune From: jorune Date: January 20th, 2006 02:34 pm (UTC) (Link)
I think Adam and myself are wondering whether they are service agreements in place. If so does the choice to stop the backups now violate those service agreement?
sobrique From: sobrique Date: January 20th, 2006 02:37 pm (UTC) (Link)
Probably. I'm planning to start the backups again this evening anyway, simply because I'm not prepared to accept being the scape goat.
xarrion From: xarrion Date: January 20th, 2006 01:21 pm (UTC) (Link)
Make sure you have it in writing/hardcopy that you've been told to stop backups. If it goes wrong now, you know who they're likely to blame ;)

I'd also raise the issue with management about correct error reporting procedure. Do you have Service Level Requests or equivalent business mumbo-jumbo over at your end?

Having said that, it could be one of those unintentional chinese-whisper cascades you get with management, where you say 'this network is a bit slow', your boss overhears and goes to his manager with a 'my people can't work due to slow IT', and so forth.
sobrique From: sobrique Date: January 20th, 2006 02:02 pm (UTC) (Link)
In writing? What, so I can prove they screwed up? I don't think so.
Chinese whispers is a definite possibility, especially when you have techy -> manager -> director -> director -> manager -> techy.

Which is one of the reasons We Don't Do It.
(Deleted comment)
sobrique From: sobrique Date: January 20th, 2006 02:44 pm (UTC) (Link)
I'm doing arse covering at the moment.
I _do_ have a fallback plan on how to recover, but it's ugly.
Oracle databases really suck to rebuild :/
warmage From: warmage Date: January 20th, 2006 03:10 pm (UTC) (Link)
Oh man do I feel you on the "we haven't had a talking head brought in to tell us what we already know" thing.

When was the last time you were asked to submit a buildup plan for disapproval? I recall in the last year you've added at least one storage appliance, and most of your kit is ready for gigabit or fibre-channel, right?

The way I feel it, the data is looming ever closer to crashdown, simply by the way the fates conspire to make it increasingly difficult to do effective recovery.

What happens if you start cold calling offsitedata storage shops and get big fat blue-sky quotes from them to contrast against a backbone upgrade?

This may not be a terribly dangerous situation now, Ed. You know, though, that the length of time left before it does is directly proportional to the MTBFR! (at this point you don't *really* trust this kit, so much as add another wad of chewing gum, aye?) And didn't you just come back up from some nightmare a couple weeks ago??
sobrique From: sobrique Date: January 20th, 2006 03:22 pm (UTC) (Link)
We have a SAN. It's about 30Tb these days. Our backup server was sized for the previous incarnation which was 9Tb. This is excluding the 'server storage creep' where now 100-200Gb of storage is the norm.
We have a _fair_ number of gigabit capable machines. However our backbone remains ATM, 155mb. And our 'server' network is 95% 100Mb, with a few gigabits that don't do a lot of good, because they have no where to go.

Basically, we've already had the 'we're understaffed' meltdown, we're getting to the 'our kit is too old' meltdown.
14 comments or Leave a comment