?

Log in

No account? Create an account
entries friends calendar profile Previous Previous Next Next
Perl - finding duplicate files - Ed's journal
sobrique
sobrique
Perl - finding duplicate files
Because I imported a bunch of files from various directories - I wasn't quite sure if I duplicated my photo collection.

The solution? Well, probably quite a few. But here it is in Perl.

First download: Activestate Perl.
http://www.activestate.com/activeperl/downloads

Then, open your favourite text editor (I'm starting to quite like Textpad - http://www.textpad.com/download/ ) but notepad will do just fine.

Place into it the following code:


#!/usr/bin/perl

use strict;
use warnings;

#File Finder - a very easy way to do this sort of thing.
use File::Find;

#Does MD5sum type things. Again, why make it hard?
#qw doesn't mean anything special - quoted words. 
#e.g. it's a list, that's space delimited. (And because it's only one, it makes no difference). 
#we have 'qw ( md5_hex )' because we want to import the 'md5_hex' subroutine from the module
#Digest::MD5
use Digest::MD5 qw( md5_hex );

my %filehashes;  #this is our 'database' of file fingerprints that we compare
my @dupe_warn;  #this gets printed at the end, with all the duplicates. 

#add more to this list, or edit as you wish. 
my @paths_to_process = ( 'C:\Users\Ed\Pictures',
                         'C:\Users\Public' );

my $output_filename = 'duplicate_files.txt';

#runs this subroutine each file it finds.
#basically takes a fingerprint, and then looks for duplicates. 
sub process
{
  my ( $full_path ) = $File::Find::name;
  
  #skip if this is a directory. 
  return if ( -d $File::Find::name );

  #otherwise open it for reading. 
  open ( my $file_h, "<", $full_path ) or warn $!;

  #take the MD5 digest of all of this file. 
  my $md5sum = md5_hex(<$file_h>);

  #print it. 
  print "$full_path : $md5sum\n";
  #See if that fingerprint exists already
  if ( $filehashes{$md5sum} )
  {
    #and if it does, print a message, and stuff it in the 'dupe warn' list
    print "-- $full_path is probably a dupe of $filehashes{$md5sum} ($md5sum)\n";
    push ( @dupe_warn, "$full_path is probably a dupe of $filehashes{$md5sum} ($md5sum)")
  }
  $filehashes{$md5sum} = $full_path;
} #end of 'process' subroutine. 


#main bit of program.
foreach my $dir_to_search ( @paths_to_process )
{
  #run a find. And tell it to run the subroutine 'process' on every file it finds. 
  find(\&process, $dir_to_search );
}

print "Duplicates found:\n";
print join ("\n", @dupe_warn);

#and in case you double clicked on this file - and the output disappeared right after running it
#we put it in a file - $output_filename - for future reference. 
open ( my $output_filehandle, ">", $output_filename ) or warn $!;
print $output_filehandle join("\r\n",@dupe_warn);
close ( $output_filehandle ); 


You'll - probably - need to edit '@paths_to_process'.
In perl, that's a list. Lists are values separated by commas, and with a curvy bracket on each end.
We use single quotes, because / has a special meaning, and we just want the literal text.
You could therefore do
@paths_to_process = ( 'C:\' ); 

and this would do all of your C drive. I wouldn't suggest that as a good idea, as it'll take a long time (because it has to open and read every file on your hard disk).
SO I'd suggest sticking with directories that you know you've stuff that might be duplicated. (E.g. pictures directories - but this doesn't really care what the file type is).

Anyway - then save that as 'duplicate_finder.pl' (or anything you like, basically, as long as it ends '.pl' which tells Perl to 'work with' this file when you double click it). I'd suggest running it from a command prompt, but that's personal taste. (It prints text - the window will probably vanish after, but don't worry as there'll be a text file there with the results)
1 comment or Leave a comment
Comments
(Deleted comment)
1 comment or Leave a comment