I’ve recently been spending some time recovering from yet another hard disk crash, and as part of that I’m taking advantage of my ages-old approach at storing photographs: using the wonders of jhead and a few other tidbits, everything is stored in a filesystem tree with nested folders for year and month, such as this one:
Photos-+-2001-+-01-+-200101012230.jpg
| | |
| | +-SHA1SUMS
. |
. +-02-+-200102010830.jpg
. . .
This results in a simple, straightforward structure that is easy to navigate and archive and a unique filename at the end of the pathname, which also helps considerably (YYYY/MM/YYYYMMDDHHMMSS.foo).
And yes, I know that iPhoto and suchlike will do a (reasonably) decent job of managing my photos, and I use it since it was incapable of handling more than a couple of thousand images. The problem here is long-term storage and archival, and iPhoto can work off such a filesystem tree and not mess around with my originals too much.
For each folder, I’ve so far been using an MDSSUMS file, which helps me ensure that when I back this up to DVD (or, as is the case, try to save my files from a bad disk) the data I’m getting back is what I saved in the first place.
Thing is, times have moved on and now SHA1 is the thing to use, but taking the long view (i.e., more than an couple of years), I’ve seen both md5 and sha1 utilities that go and do their own thing regarding storing digest files and whatnot (from using parenthesis to a lot of extraneous junk around both the digest and the filename), so I decided to keep mine simple:
hexdigest filename
Furthermore, since OSX has shown a regrettable tendency to either not include md5sum or to twiddle its output format over the years, I’ve decided to go and code my own set of utility functions in Python to compute and confirm the hash values – the trick here is to use mmap(), which makes it as fast (if not faster) than C utilities.
Here’s the set of utility functions I’m using, which will take both kinds of hash indexes, compute sha1 for a given directory, and move out of the way for manual checking any files that don’t match either of the indexes for some reason:
#!/usr/bin/env python # encoding: utf-8 """ integritycheck.py Created by Rui Carmo on 2009-10-24. Published under the MIT license. """ import hashlib, os, sys, mmap, glob testpath = '/Users/Shared/Pictures/Photos/2008/06' hashes = { 'md5':{'index':'MD5SUMS','function':hashlib.md5}, 'sha1':{'index':'SHA1SUMS','function':hashlib.sha1} } checkfolder = "Check" def parsehashindex(path, compute=False): """Try to parse all known existing hash index files and build a data structure with them""" files={} for h in hashes.keys(): files[h]={} hashfile = os.path.join(path,hashes[h]['index']) if os.path.exists(hashfile): # this rather naïvely assumes that we'll be getting pretty standard MD5SUM files # with a hash and a filename separated with one or two spaces and some spurious characters (like *) listing = [x.strip().replace("*",'').replace(' ', ' ').split(' ') for x in open(hashfile).readlines()] files[h] = dict([[x[1],x[0]] for x in listing]) # now search for files we don't know about - assume we're always looking for files with extensions for f in glob.glob(os.path.join(path, '*.*')): if os.path.basename(f) not in files[h].keys(): if compute: hexdigest = getsinglehash(f,h) files[h][f] = hexdigest fh = open(hashfile,"wa") fh.write("%s %s\n" % (hexdigest, f)) fh.close() else: files[h][f] = "#" # this effectively flags the file as unverified for this hash function return files def getsinglehash(filename,type): """Compute the hash value for a single file""" m = hashes[type]['function']() fh = open(filename,'rb') fs = os.path.getsize(filename) data = mmap.mmap(fh.fileno(), fs, mmap.MAP_PRIVATE, mmap.PROT_READ) m.update(data) fh.close() return m.hexdigest() def checkhashes(path,hashdata): """Compute hashes of known files and confirm them against previously parsed hash data""" for h in hashdata.keys(): print "Checking %s" % h files = hashdata[h] for i in files.keys(): f = os.path.join(path,i) try: hexdigest = getsinglehash(f,h) except IOError: print "Error trying to compute hash for %s" % os.path.basename(f) # not going to move the file since it may be a serious filesystem error and worth investigating further continue if hexdigest != hashdata[h][i]: print "File %s has incorrect %s hash, moving it" % (i,h) # This is the most common case - most often there have only been EXIF tweaks after the original index was created, so these should be easy to check if not os.path.exists(os.path.join(path,checkfolder)): os.makedirs(os.path.join(path,checkfolder)) os.rename(i,os.path.join(path,checkfolder,os.path.basename(i))) def computeSHA1SUMS(path): """This is a nice standalone function for folk to steal and use that demonstrates the mmap trick""" h = {} for f in glob.glob(os.path.join(path, '*.*')): (dummy,i) = os.path.split(f) s = hashlib.sha1() fh = open(f,'rb') fs = os.path.getsize(f) data = mmap.mmap(fh.fileno(), fs, mmap.MAP_PRIVATE, mmap.PROT_READ) s.update(data) fh.close() h[i] = s.hexdigest() fh = open(os.path.join(path,"SHA1SUMS"), "w") s = h.keys() s.sort for i in s: fh.write("%s %s\n" % (h[i], i)) fh.close() if __name__ == '__main__': files = parsehashindex(os.path.join(testpath)) checkhashes(testpath,files) computeSHA1SUMS(testpath)
(download)
These are a toolkit and not a finished solution, but they (and the amazing Index sheet view in Quicklook in Snow Leopard, which affords me effortless visual inspection of folders with hundreds of images) go a long way towards helping me making sure (or at least checking that) my files are correctly preserved – and hopefully will be so for many years to come, regardless of storage.
One improvement I’ll be adding (besides better cleanup of the way I handle paths and deal with the problem of having non-indexed files in the same directory) is automated JPEG loading and checking, although (since my media collection has started incuding more and more movie files) that will necessarily be somewhat limited.
