?

Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Compound Probabilities, RAID 5, RAID 6, meant time between failures - Ed's journal
sobrique
sobrique
Compound Probabilities, RAID 5, RAID 6, meant time between failures
This week, I've been trying to work out the relative merits of RAID5 vs. RAID6 as a method of disk protection.
I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.
Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.

RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out

Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.

So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?

When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.

So I'm working on 3 'choices' here.
RAID 5, 3+1
RAID 6, 6+2
As both these 'types' waste 25% of capacity, and therefore cost the same.
For comparison, I'm considering RAID 5, 7+1.

Now, the number crunching goes thus:
MTBF of 1million hours.
Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).

Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).

And over a 3 year time period, how likely that circumstance is to show up.

So I make it:
MTBF 1 million hours.
Chance of failure in a given 96 hour block - 0.000096

Taking a 4 disk set - chance of any single failure is:
1 - ( 1 - .000096 ) ^4 = 0.00038
From an 8 disk set, same logic:
1 - ( 1 - 0.000096 ) ^ 8 = 0.00076

Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).

So with the R5 set first drive fail is ok. Second is a total loss.
So chance of losing a second drive out of your 4 disk set is:
For R5, 3+1 we've got:
99.961% chance that of 4 drives none fail.
99.971% chance that of 3 drives, none fail.

So -
3.839x10 ^ -4 x 2.8 x 10 ^ -4.
= 1.1 x 10 ^ -7.
11 in 100,000,000 chance of occurring.

For the RAID 5, 7+1:
Chance of any one out of the 8, is 'chance of not failing' ^ 8.
So 99.923%.

Chance of remain drive from the set of 7 failing, in the same 96 hour
window, is:
'chance of not failing' ^ 7.
So 99.932%

7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.
A 51 in 100,000,000 chance of occuring.

And for the RAID 6, 6+2:
First drive: 99.92322580
Second Drive: 99.93281935
Third Drive: 99.94241382

Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.

Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.

If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.
I think you can apply the same rational to that:
Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.

So 240 drives:
R5, 3+1 = 6.63E-006
R5 7+1 = 1.55E-005
R6 6+2 = 8.91E-009

Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?
Making the R6, 6+2 scenario - over 3 years = 26280 hours.
Our number is over 96 hours - of which there's 273 chunks.
So ... 1 - ( 1 - 8.91 E-009 ) ^ 273
= 2.43E-006

So, 2 in a million chance of having a really really bad week.
Does my number crunching work out correctly though?

R5, 3+1 = 1.41E-3
R5, 7+1 = 4.22E-3

So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.
The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).
10 comments or Leave a comment
Comments
darkgodfred From: darkgodfred Date: December 2nd, 2011 10:21 pm (UTC) (Link)
It's been ages since I did the relevant statistics but I get the feeling that you might want to use geometric distribution. Once a drive has failed you cease to care about any further probability for it - it's not going to fail again.

Thus the questions are

1) What is the probability that one of my drives will fail within three years?

and 2) Given that one of my drives has failed, what is the probability that a second will fail within 96 hours.

And then the Raid6 makes things complicated by asking what is the probability of a 3rd drive failure in whatever of my 96 hours is left which I can't think of an elegant way of defining without actually specifying all 96 possible situations (P = P(failure of drive two in one hour and drive three within 95 hours) + P(failure of drive two in two hours and drive three within 94 hours) + ....)

Plus there's the fact that your probability of failure is not uniform - the same drive type likely bought from the same manufacturing batch and doing the same work is likely to fail at the same time.

I think you're safe to make the assertion that the same distortion applies to both RAID5 and RAID6 and can therefore be disregarded for means of comparison. But by the same logic you should also be able to substitute much easier numbers and still make the comparison between methodologies - 1/10 chance of failure per hour, 10 hours to replace, 1000 hours operating window, etc. Because the only thing that matters is the relative failure rates - not how likely a failure actually is.
sobrique From: sobrique Date: December 2nd, 2011 10:36 pm (UTC) (Link)
True. Linear failure rate on MTBF is a flawed assumption - you've the usual bathtub curve on drive failures, above and beyond the correlated failure.
I've ignored that, because it's a nuisance, and just sort of assumed an approximation.

I'll be happy if i'm in the right order of magnitude, but ... it seems a bit wrong to me that the difference would be quite as high as it looks lie it is...
*shrug*.

I know odds are remote, but it's because I'm looking at big environments that it's a concern. We're looking at doing virtual provisioning - it's a good trick that lets you define disk devices, and allocate storage 'on demand' rather than in advance - allowing over subscription.
But the drawback is, it splatters data across all the volumes in your 'pool' like muck out of a muckspreader. So you potentially lose the whole damn lot if you have a double fault. (Where in normal situations, it's painful, but the volume of data is an 8 disk 'set' to recover, rather than a 240 disk set).

The next trick will be storage tiering - that's another cool trick, that lets you make your virtual provisioned LUNs up out of different tiers of device.
So you could create a device that's 10% solid state, 40% fiber channel, and 50% SATA.
But the good bit is, the array will automatically reshuffle the data, based on usage profiles - so your 'intensive' bit stages up to SSD, and your 'junk' falls down into SATA.
Which given most usage profiles, saves a fortune on your disks - you see SSD performance, but end up buying more SATA as your array fills up.
mister_jack From: mister_jack Date: December 3rd, 2011 03:38 pm (UTC) (Link)
I would suggest that hard drive failures are not independent events. The drives in a computer are subjected to the same heat/cool cycles, the same electrical spikes and roughly the same usage patterns as well as usually being the same model of drive, purchased at the same time from the same supplier.

This all means that while, in principle, you can multiply up individual failure rates to get array failure rates; in practice, the chance of multiple failures is much higher than predicted by this method.
sobrique From: sobrique Date: December 4th, 2011 12:54 am (UTC) (Link)
Modelling an accurate pattern would be useful, certainly. I'm not sure how it could be accomplished though. At least, not at the same time as making it completely inscrutable to the people weighing up the business case. We try to minimizes power and thermal variance in a data centre. But drives are generally the same model and age. I was hoping a million hours MTBF though, would help, as you're generally talking tens of thousand hours operational life.
mister_jack From: mister_jack Date: December 4th, 2011 03:30 pm (UTC) (Link)
It cries out for empirical rather than modelled data, doesn't it? I wonder whether anyone has collected it.
sobrique From: sobrique Date: December 4th, 2011 03:44 pm (UTC) (Link)
Probably - I was having a read of someone's thesis about RAID and reliability and the like.
I have no doubt that hardware manufacturers know, but are trying very hard to muddy the waters.
Hunting on Google doesn't elicit anything interesting.
(Deleted comment)
sobrique From: sobrique Date: December 4th, 2011 04:17 pm (UTC) (Link)
Unfortunately in the 3 year time span we're talking about, by then it'll be a moot point.
mister_jack From: mister_jack Date: December 3rd, 2011 03:35 pm (UTC) (Link)
Me, I'd set up as a mirror array and sod the extra usage. RAID 5/6 aren't worth the performance drop.
sobrique From: sobrique Date: December 4th, 2011 12:46 am (UTC) (Link)
It's a reasonable view. Write penalty of RAID 1+0 is much lower.
But on the other hand RAID 5 isn't as painful as it looks with reasonable prefetch and write cache. And the cost overhead of 16% for 7+1 (25% for the others) adds up to disgusting amounts - it's not just drive cost (although EFD and FC drives aren't exactly cheap) as much as data centre space, enclosures, clean power feeds, air conditioning, controllers and maintenance. The costs clock up quickly.
10 comments or Leave a comment