This week, I've been trying to work out the relative merits of RAID5 vs. RAID6 as a method of disk protection.

I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.

Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.

RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out

Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.

So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?

When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.

So I'm working on 3 'choices' here.

RAID 5, 3+1

RAID 6, 6+2

As both these 'types' waste 25% of capacity, and therefore cost the same.

For comparison, I'm considering RAID 5, 7+1.

Now, the number crunching goes thus:

MTBF of 1million hours.

Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).

Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).

And over a 3 year time period, how likely that circumstance is to show up.

So I make it:

MTBF 1 million hours.

Chance of failure in a given 96 hour block - 0.000096

Taking a 4 disk set - chance of any single failure is:

1 - ( 1 - .000096 ) ^4 = 0.00038

From an 8 disk set, same logic:

1 - ( 1 - 0.000096 ) ^ 8 = 0.00076

Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).

So with the R5 set first drive fail is ok. Second is a total loss.

So chance of losing a second drive out of your 4 disk set is:

For R5, 3+1 we've got:

99.961% chance that of 4 drives none fail.

99.971% chance that of 3 drives, none fail.

So -

3.839x10 ^ -4 x 2.8 x 10 ^ -4.

= 1.1 x 10 ^ -7.

11 in 100,000,000 chance of occurring.

For the RAID 5, 7+1:

Chance of any one out of the 8, is 'chance of not failing' ^ 8.

So 99.923%.

Chance of remain drive from the set of 7 failing, in the same 96 hour

window, is:

'chance of not failing' ^ 7.

So 99.932%

7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.

A 51 in 100,000,000 chance of occuring.

And for the RAID 6, 6+2:

First drive: 99.92322580

Second Drive: 99.93281935

Third Drive: 99.94241382

Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.

Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.

If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.

I think you can apply the same rational to that:

Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.

So 240 drives:

R5, 3+1 = 6.63E-006

R5 7+1 = 1.55E-005

R6 6+2 = 8.91E-009

Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?

Making the R6, 6+2 scenario - over 3 years = 26280 hours.

Our number is over 96 hours - of which there's 273 chunks.

So ... 1 - ( 1 - 8.91 E-009 ) ^ 273

= 2.43E-006

So, 2 in a million chance of having a really really bad week.

Does my number crunching work out correctly though?

R5, 3+1 = 1.41E-3

R5, 7+1 = 4.22E-3

So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.

The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).