Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
The large filesystem problem - Ed's journal
The large filesystem problem
I'm currently musing on a difficult problem. Given a large storage estate, which contains some large filesystems, what is an efficient way to process 'the whole lot'.
As an illustrative case - take virus scanning. It's desirable to periodically scan 'everything'. There's other scenarios such as backups, accounting and probably a few others.
But it's lead me to consider it - given an order of magnitude of a petabyte, distributed over a billion or so files. What is an efficient way to do it?
Again - take the same illustrative case. A virus scanner, that can process 100k files per hour. At that rate, you're looking at 10,000 hours - or a little over a year. Even if you could keep a system doing that all the time, you're still faced with - potentially - having to keep track of how far you got, on something that's changing as you go.

So with that in mind, I'm thinking about ways to scale the problem. The good bit is - as you end up with substantial numbers, you also have a lot of infrastructure to make use of - you can't physically get to a petabyte, without a lot of spindles and controllers. And that usually means array level readahead caching too.
Which means optimally, you'll 'go wide' - try and make use of every spindle and every controller at once. And also, ideally doing it whilst maximising readahead efficiency, and minimising contention. (And of course, given the timescale you almost certainly have to 'fit in' with a real production workload, including backups).

The problem can be simplified to - given a really large directory structure, what's an efficient way to traverse it and break it down into 'bite size pieces'. Again, following on the virus checking example - maybe you want to break down into '100k file' pieces, because then each chunk is about an hour of processing, which can be queued and distributed. And then you will scale this, by taking each filesystem as a standalone object, to be traversed and subdivided.

You may also end up having to do something similar in future too - again, virus checking - you probably want to repeat the process, but you can then apply some sort of incremental checking (e.g. check file modification times perhaps, although that maybe unwise unless you can verify that the file actually is unchanged).

The other part of the problem is - well, you can't easily maintain a long list of 'every file' - for starters, you already essentially do that - it's called 'your filesystem'. And otherwise you're looking at a billion record database, which is also ... well, a different scale of problem.

So I've started reading about Belief Propagation https://en.wikipedia.org/wiki/Belief_propagation - but what I'm thinking of in terms of approach is to - essentially - use checkpoints to subdivide a filesystem. You use a recursive traversal (e.g. similar to Unix's 'find') but you work on a 'start' and 'end' checkpoint. Skip everything until start, process and batch everything up until 'end' checkpoint.
Ideally, you'll measure distance between your checkpoint as you go, and 'mark off' each time you complete a batch.

For the sake of parallelising and distributing though - I'm thinking that given you _can_ tell a number of inodes allocated to a filesystem (which approximates the number of files) you can then tell how many 'checkpoints' you would need within that filesystem. At which point you start traversing downwards, in depth order, until you get a number of directories that are in the right order of magnitude - and use each of those as your first set of checkpoints. As you run, redistribute the checkpoints by simply taking - for a batch size of n - take a new checkpoint every n/2 files, and if the distance between the first and last checkpoint is less than n/2, simply delete it. That should mean you get 'checkpoints' between n/2 and n in size. There'll be some drift between iterations, but as long as it's within the same order of magnitude, that doesn't matter overly.
Start 'finding', accumulate 'a few' batches, and then leave them to be processed, moving on to a different 'part' of the storage system, to do the same. (Separate server, location, whatever). You don't want your search to get too far ahead of your processing - you're probably looking at memory buffering your batches, and having too much buffered is a waste.

But it'll always bit a bit of a challenge - fundamentally, there's only so fast you can move substantial volumes of data.
Leave a comment