Yojimbo on the Cheap

This post is part of my Data For The Ages category, which groups all my attempts at making sure the data I use will be minimally readable in the long term.

Update: Those in a hurry (and with some expertise) might want to play around with , which is my first pass at archiving web pages in a standard MIME container (including inline images).

Prompted by another round of enthusiastic posts regarding and finding myself looking for a way to easily keep track of research material, I decided to take a another look at it - which included exploring its innards a bit.

Under The Kimono

It turns out that stores all of its stuff in a SQLite database at /Users/you/Library/Application Support/Yojimbo/Database.sqlite, which is a clean, fast and native way to do this kind of thing in . And it ought to be portable across platforms - if you bother to tease out the schema, of course, as well as deal with the internal formats.

Using (demo, limited to 20 rows per query), it took me no time at all to figure out that stores stuff like web pages as BLOBs, which has the advantages of being both simple to implement and retaining the original format in its maximum fidelity - but which is, well, as non-portable as you can possibly make it.

Data for the Ages? I don't think so.

You see, although it's perfectly feasible to throw a PDF file into and have some hope of getting it out later - since the binary format is more or less well known, and it should be readable ten years from now - I'm somewhat skeptical of the ability of any non- software to read an "web archive" (which is not RFC:2387 compliant), or any of the other data you stick in it.

To 's credit, the .webarchive format seems to be reasonably straightforward - it is essentially the serialization of a NSDictionary, storing the HTML and all inline imagery, etc. as a binary property list (including some of the headers from the server for each chunk). It is, however, as proprietary you can get.

So, after having some fun throwing a few things into and dumping the ZBLOB.ZBYTES field to confirm my theory, I started wondering precisely what its added value was.

Which is pretty obvious, really - it's pretty damn simple to use, lets you tag and encrypt items, and lets you find things pretty fast (as well as arbitrarily slicing and dicing your data in views of your own choosing).

I didn't look into the encryption used, but I assume it's using the native AES stuff and see no reason to tinker with it, although I'm curious to see how it interacts with search (it shouldn't store cleartext indexes of encrypted data, for one thing).

So is fine if you only use s (which I don't, not exclusively), and will probably be more than enough for most people.

For me, however, the allure quickly faded away.

Why Use Another Wheel?

As it turns out, I've got two native applications that let me:

  • File, group and search arbitrary data
  • Store it centrally on a server
  • Flag items of interest
  • Share it in standard formats with other platforms (and people!)

And, with some add-ons, either of them can even do:

...provided, of course, you are willing to jump through some hoops.

And the first application is (you might be surprised to know)... - plus an IMAP server.

The Rationale

Having kept all my stuff in an IMAP store for years, I'm used to storing a bunch of stuff in it, as well as accessing and searching it from a number of mail clients - and provided you put stuff in the right way, I've found I can always get it out again regardless of the platform I'm using.

So, let's go through the list above, shall we?

  • 's "Email To... (Compose)" makes it trivial to create a draft message from anything, which can then be saved, dragged and filed in an IMAP folder.
  • You can e-mail a PDF of anything straight from the Print dialog box.
  • has a "Mail Contents Of This Page" option.

Regardless of how you create the draft, the act of "filing" itself can be performed and enhanced with an custom action that creates the message, saves it to the server (without actually e-mailing it anywhere, of course) and moves it somewhere else (I'm tinkering with that at the moment, and will post the results when I'm happy with them).

As to grouping, you can group stuff by folders (obviously) or create search folders. Spotlight indexes everything in my IMAP store (provided you let store full copies of your messages locally, of course), and flagging is, well - trivial.

As to sharing stuff with other platforms, the only thing you don't get right off the bat is - you guessed it - web archives.

The Missing Bits

does have a "Mail Contents Of This Page" command, but that creates a mail message containing only the HTML (i.e., it does not include inline images as MIME parts) - it seems complete (and messages created in this way are readable by Thunderbird, for instance), but it's not suitable for long-term storage.

Well, as it happens, I read my RSS/Atom feeds as e-mail - i.e., has code to create a properly MIME-formatted message containing inline images (which I decode for my ).

It doesn't handle CSS (or Flash, or other embedded media), but then again, those are not usually the sort of things you want to file (and I'm not sure 's .webarchive deals with them flawlessly, which means will suffer from the same caveat).

But it shouldn't be much trouble to take that code and create a script that grabs the HTML content from a specific URL and e-mails it to my archive account directly (plus the requisite custom action, of course).

As to tagging, I have two obvious solutions:

  • Use (nice, -native and to store its tagging information in IMAP headers, although the base64 encoding makes it impossible to use for simple cross-platform searching...)
  • Store them in the Subject: field (easy to use with Spotlight searches, smart folders and just about anything else, can be done at creation time from a custom action)

Finally, where it regards encryption, either S/MIME or PGP are able to deal with encrypted MIME multipart data - I haven't experimented much yet, but storing encrypted drafts on an IMAP server appears to be entirely possible (searching, obviously, is completely out of the question, unless you rely exclusively on Subject: lines).

Hey! You Mentioned Two Applications!?

Oh, yes, of course. You see, the other application that lets me do pretty much everything does is... the .

You just have to set up a few search folders, really. Takes a bit more patience (and forethought), but has a tagging plugin that works just fine, and that's why gave us Spotlight anyway...

This page is referenced in: