Even amidst all the ruckus caused by my home renovations, my temporary loss of two other machines and a whole new set of personal logistics (besides work, of course), I can’t stop pondering solutions for information management – and that includes RSS.
Now, if you happen to recall my latest piece regarding my RSS setup, you’ll remember that I am still tackling the issue of how to go about doing Bayesian classification, and that I was using newspipe as an archiver.
A lot of people missed the point there and pointed out that Mail.app has built-in RSS support – which is correct, except that Mail.app does not store inline images or enclosures along with the feed items, something that I find to be a rather myopic omission (just filed as #5777759) and absolutely essential for archiving.
Now, Mail.app stores its RSS items into .emlx format (which Jamie Zawinsky documented), just like “ordinary” messages. And the format is pretty straightforward:
byte count for message as first line MIME dump of message XML plist with flags
And the MIME dump invariably contains a part with well-formed HTML (at least the samples I looked at), with neat direct references to inline images and stuff.
So I had one of those shower epiphanies: Why not parse the .emlx file, download the referenced images (since the URLs are low-hanging fruit), and add the images back into the .emlx file as inline attachments?
That way I could just use Mail.app (without newspipe in the middle) and run a simple archival script every now and then.
Lo and behold, after 20 minutes of Python coding (and thanks to the ineffable miracle that is Beautiful Soup), I have a little proof of concept that does just that – images in RSS items are downloaded, injected into a new MIME message, and the whole thing is replaced into the .emlx file, updating the byte count appropriately.
And Mail.app seems to like it, too.
Update: Here’s the source code, after a few cleanups and some re-structuring towards making it a Python class that I can re-use later:
#!/usr/bin/env python # encoding: utf-8 """ emlx.py Created by Rui Carmo on 2008-03-03. Released under the MIT license """ from BeautifulSoup import BeautifulSoup import os, re, codecs, email, urllib2 from email.MIMEImage import MIMEImage from email.MIMEMultipart import MIMEMultipart # Message headers used by Mail.app that we want to preserve preserved_headers = [ "X-Uniform-Type-Identifier", "X-Mail-Rss-Source-Url", "X-Mail-Rss-Article-Identifier", "X-Mail-Rss-Article-Url", "Received", "Subject", "X-Mail-Rss-Author" "Message-Id", "X-Mail-Rss-Source-Name", "Reply-To", "Mime-Version", "Date" ] class emlx: """emlx parser""" def __init__(self, filename): """initialization""" self.filename = filename self.opener = urllib2.build_opener() # Mimic Mail.app User-agent self.opener.addheaders = [('User-agent', 'Apple-PubSub/59')] self.load() def load(self): # open the .emlx file as binary (and not using codecs) to ensure byte offsets work self.fh = open(self.filename,'rb') # get the payload length self.bytes = int(self.fh.readline().strip()) # get the MIME payload self.message = email.message_from_string(self.fh.read(self.bytes)) # the remaining bytes are the .plist self.plist = ''.join(self.fh.readlines()) self.fh.close() def save(self, filename): fh = open(filename,'wb') # get the payload length bytes = len(str(self.message)) fh.write("%d\n%s%s" % (bytes, self.message, self.plist)) fh.close() def grab(self, url): """grab images (not very sophisticated yet, doesn't handle redirects and such)""" h = self.opener.open(url) mtype = h.info().getheader('Content-Type') data = h.read() return (mtype,data) def parse(self): for part in self.message.walk(): if part.get_content_type() == 'text/html': self.rebuild(part) return def rebuild(self,part): # parse the HTML soup = BeautifulSoup(part.get_payload()) # strain out all images referenced by HTTP/HTTPS images = soup('img',{'src':re.compile('^http')}) count = 0 # prepare new MIME message newmessage = MIMEMultipart('related') for h in preserved_headers: newmessage.add_header(h,self.message[h]) attachments = [] for i in images: # Grab the image (mtype, data) = self.grab(i['src']) # Build a cid for it subtype = mtype.split('/')[1] cid = '%(count)d.%(subtype)s' % locals() # Create and attach new MIME part # we use all reference methods to ensure cross-MUA compatibility image = MIMEImage(data, subtype,name=cid) image.add_header('Content-ID', '<%s>' % cid) image.add_header('Content-Location', cid) image.add_header('Content-Disposition','inline', filename=("%s" % cid)) attachments.append(image) # update references to images i['src'] = 'cid:%s' % cid count = count + 1 # inject rewritten HTML first part.set_payload(str(soup)) newmessage.attach(part) # now add inline images as extra MIME parts for a in attachments: newmessage.attach(a) # replace the message self.message = newmessage if __name__ == "__main__": a = emlx('320611.emlx') a.parse() a.save('injected.emlx')
Right now, I’m considering tweaking the plist flags a bit, and since I absolutely loathe the bright blue header Mail.app uses to display feed items (which often hides large portions of item titles) I will be doing outright conversion to “normal” e-mail messages.
Plus, of course, I still need a decent way to invoke it upon an entire folder crammed with RSS items. That is easy enough to do, but I’d rather try to code something that can be re-used by other folk, and as such I’m looking into developing an Automator action for this.
Time (my scarcest resource) will tell if it’s doable. Still, I wonder why Apple doesn’t allow for archival of RSS items with inline images – it’s not as if they don’t have all the pieces (and Automator already has plenty of RSS support…).