Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Performance metrics and analyzing a lot of data. - Ed's journal
Performance metrics and analyzing a lot of data.
At the moment, I'm trying to do some data analysis.
I have a storage array. In this storage array are somewhere around 15,000 logical devices, 1200 physical drives, and a whole assortment of other redundant subcomponents.

For all the components in there, I have a list of assorted performance metrics.
ios per second, average io size, sampled average read/write time, reads per sec, writes per sec... well, yeah.

Problem is, it's a lot of devices, and a lot of metrics. I can draw pretty pictures of one of the these metrics quite easily - but that doesn't necessarily tell me what I need to know.
When I'm troubleshooting, I need to try and get a handle on ... well, quite _why_ the response time of a device increased dramatically.

So I thought what I'd do is ... some kind of way of working out correlation across the data set.
My life is made simpler, by every one of my metrics have been sampled at a defined frequency - so they, in a sense, all line up.

So far, I'm going down the line of 'smoothing' the data set as a moving average (http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg02018.html) and then comparing that to the original.

The idea being that a 'bit of wiggle' won't make much odds, and nor will a upward curve during the day - but a step change will cause a deviation, depending upon the weighting of the smoothing function.
From there, I'm thinking I take the deviation, square it - to 'amplify' differences, and then apply a threshold filter - so any small-ish deviations disappear entirely.
Now so far, that is giving me what I want - I'm thinking that I can now start ... figuring out when a 'deviation' begins, and how long it lasts from that - and try to match that pattern against another metric - look for other variances that fit a similar profile - starting concurrently, and ideally lasting about the same sort of time.

I'm matching against the longest duration, because I figure that that's most likely to be matching up if the two deviations are correlated.

Er. But I suspect I'm re-inventing the wheel a bit here, because I'm vaguely remembering some bits of stats and maths, but not really enough detail to remember what it's called, and what I need to look up further.

So ... anyone able to help out, and point me in the right direction? Bonus points if it gives a really neat method of linking up metrics with a vaguely useful level of reliability.

What I'm ideally wanting to do is be able to match e.g. an increased response time on a device, with an increased throughput on a disk controller, or elevated seek activity on a set of disks - such that I can in theory filter down my data to a level where I've got a fairly good idea which bits are correlated, and then I can try and figure out where the root cause lies.

Oh, and the other question is - is there any massive flaws in my logic, that mean I'm wasting my time?

Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed).
10 comments or Leave a comment
queex From: queex Date: March 15th, 2010 11:51 pm (UTC) (Link)
If you're applying a threshold, squaring the deviation does nothing in particular.

Essentially, what you have is a series of indicators values that represent abnormal conditions for each drive and metric, and you want some form of signalling when a number of metrics for a drive start to go out of whack.

It's a subtle problem, and there are 3 distinct parts to it:
a) what constitutes a deviation (filtering out transient disturbances)
b) how many indicators deviating constitutes an overall deviation
c) a means of tracking all of the above to perform retrospective analysis.

The approach I'd take (primarily because I'm familiar with it) would be dynamic linear models. They're kind of a generalisation of MA and ARIMA (Autoregressive Integrated Moving Average) models. Their advantage would be that they're Bayesian models that would adapt to the data with a little lead time and follow gradual changes without any manual intervention. One of the values the model generates can be interpreted as deviation from the standard and I *think* there's established methods for tracking cumulative deviation for exactly this kind of analysis. I don't have my copy of West and Harrison with me, but I'll try to remember to look through it tomorrow.
sobrique From: sobrique Date: March 16th, 2010 07:44 am (UTC) (Link)
Ah lovely, that's something I can start with at least.
I think my thinking in the squaring was... that a larger step would provide a proportionally larger square. But I see what you mean - the only thing that square is doing is essentially square-rooting the threshold.
Well, and making all my numbers positive.

I think I was remembering something about least squares fit, but that's not really relevant here ;p. Perhaps taking a cumulative sum of a block, vs. a cumulative square sum - the latter would reflect a 'spike' more than a smaller, longer divergence.

And yes, what constitutes a deviation is the one I'm wrestling with. And may mean I'm just barking up the wrong tree from the start.

We shall see. I'll have a look at dynamic linear models and see what I can draw from it.
queex From: queex Date: March 16th, 2010 11:55 am (UTC) (Link)
In fact there are ways of using cumulative Bayes factors to detect breakdowns in predictive performance.

'Bayesian Forecasting and Dynamic Models 2nd Ed.' West & Harrison (1999) Springer

is the touchstone, but it might be heavy going. There is a DLM package for R, but if you're not familiar with R that's little help.

Possibly the best way to start would be to set up a DLM for the individual metrics and see if that works tolerably well first. It might be handy to work with historical data with a known deviation or two somewhere in it while your exploring the problem.

(Actually, one thing that's handy about the Bayes factors approach is that its quantitative- so you can set your detection threshold equal to 'drive explodes in a ball of flame' and have a means of equating that level of failure with performance degraded by X over a span of time Y.)
mister_jack From: mister_jack Date: March 16th, 2010 09:25 am (UTC) (Link)
How often are you sampling? If it's small relative to the period you wish to recognise problems in, just perform a Student's t-test comparing the last 10-20 records to the 10-20 before that, set it to notify on a very low probability (say p < 0.005 or so) and Robert is very much your Mother's Brother.

Although depending on the fail curve you may want to compare to a base line rather than to the last reading. If you do baseline it, use a bigger sample size for that baseline.
queex From: queex Date: March 16th, 2010 12:15 pm (UTC) (Link)
You might have to tune that approach to deal with 'normal' fluctuations without missing key ones, but it's an easy to option to try out to see if it works.

I envisaged Ed wanting to detect something like

50% deviation now
20% deviation over the last 50 samples
10% deviation over the last 100 samples

more-or-less equally, which a single t-test might struggle with. You could try having a series of tests for each metric to help with that.
mister_jack From: mister_jack Date: March 16th, 2010 12:38 pm (UTC) (Link)
Without knowing the data set it's hard to say, but since what a t-test tests is specifically whether the two samples are drawn from the same or different populations it seems a natural candidate.

Especially as it's pretty easy to implement.
sobrique From: sobrique Date: March 16th, 2010 11:01 pm (UTC) (Link)
Well, if you want a copy of the data :).
But ... I'm mostly trying to think of 'some kind' of enhanced analysis that allows me to link together related stats.
Sampling frequency is 15m through the day, so I don't have all _that_ many data points.
mister_jack From: mister_jack Date: March 16th, 2010 12:02 pm (UTC) (Link)
Oh, and "Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed)." - is Excel on your list of approved software, it has remarkably powerful built it statistical analysis which would take a lot of the donkey work out of your implementation.
sobrique From: sobrique Date: March 16th, 2010 12:05 pm (UTC) (Link)
Excel is also allowable.
Although, it tends to throw up when I try an stuff 500Mb of assorted data down it's neck, which is why I've started out with perl.
queex From: queex Date: March 16th, 2010 12:17 pm (UTC) (Link)
Excel really, really hates large data sets. Yesterday it took four hours to do a simple task that R would have done in less than a minute, consuming 100% of both processors all the time.
10 comments or Leave a comment