Patching .emlx files

Even amidst all the ruckus caused by my home renovations, my temporary loss of two other machines and a whole new set of personal logistics (besides work, of course), I can’t stop pondering solutions for information management – and that includes RSS.

Now, if you happen to recall my regarding my RSS setup, you’ll remember that I am still tackling the issue of how to go about doing Bayesian classification, and that I was using as an archiver.

A lot of people missed the point there and pointed out that has built-in RSS support – which is correct, except that does not store inline images or enclosures along with the feed items, something that I find to be a rather myopic omission (just filed as “#5777759”:Radar:5777759) and absolutely essential for archiving.

Now, stores its RSS items into .emlx format (which Jamie Zawinsky documented), just like “ordinary” messages. And the format is pretty straightforward:

byte count for message as first line
MIME dump of message
XML plist with flags

And the MIME dump invariably contains a part with well-formed HTML (at least the samples I looked at), with neat direct references to inline images and stuff.

So I had one of those shower epiphanies: Why not parse the .emlx file, download the referenced images (since the URLs are low-hanging fruit), and add the images back into the .emlx file as inline attachments?

That way I could just use (without in the middle) and run a simple archival script every now and then.

Lo and behold, after 20 minutes of coding (and thanks to the ineffable miracle that is Beautiful Soup), I have a little proof of concept that does just that – images in RSS items are downloaded, injected into a new MIME message, and the whole thing is replaced into the .emlx file, updating the byte count appropriately.

And seems to like it, too.

Update: Here’s the source code, after a few cleanups and some re-structuring towards making it a class that I can re-use later:

#!/usr/bin/env python
# encoding: utf-8
"""
emlx.py

Created by Rui Carmo on 2008-03-03.
Released under the MIT license
"""

from BeautifulSoup import BeautifulSoup
import os, re, codecs, email, urllib2
from email.MIMEImage import MIMEImage
from email.MIMEMultipart import MIMEMultipart

# Message headers used by Mail.app that we want to preserve
preserved_headers = [
  "X-Uniform-Type-Identifier",
  "X-Mail-Rss-Source-Url",
  "X-Mail-Rss-Article-Identifier",
  "X-Mail-Rss-Article-Url",
  "Received",
  "Subject",
  "X-Mail-Rss-Author"
  "Message-Id",
  "X-Mail-Rss-Source-Name",
  "Reply-To",
  "Mime-Version",
  "Date"
]

class emlx:
  """emlx parser"""
  def __init__(self, filename):
    """initialization"""
    self.filename = filename
    self.opener = urllib2.build_opener()
    # Mimic Mail.app User-agent
    self.opener.addheaders = [('User-agent', 'Apple-PubSub/59')]
    self.load()
  
  def load(self):
    # open the .emlx file as binary (and not using codecs) to ensure byte offsets work
    self.fh = open(self.filename,'rb')
    # get the payload length
    self.bytes = int(self.fh.readline().strip())
    # get the MIME payload
    self.message = email.message_from_string(self.fh.read(self.bytes))
    # the remaining bytes are the .plist
    self.plist = ''.join(self.fh.readlines())
    self.fh.close()
    
  def save(self, filename):
    fh = open(filename,'wb')
    # get the payload length
    bytes = len(str(self.message))
    fh.write("%d\n%s%s" % (bytes, self.message, self.plist))
    fh.close()
    
  def grab(self, url):
    """grab images (not very sophisticated yet, doesn't handle redirects and such)"""
    h = self.opener.open(url)
    mtype = h.info().getheader('Content-Type')
    data = h.read()
    return (mtype,data)
  
  def parse(self):
    for part in self.message.walk():
      if part.get_content_type() == 'text/html':
        self.rebuild(part)
        return
        
  def rebuild(self,part):
    # parse the HTML
    soup = BeautifulSoup(part.get_payload())
    # strain out all images referenced by HTTP/HTTPS
    images = soup('img',{'src':re.compile('^http')})
    count = 0
    
    # prepare new MIME message 
    newmessage = MIMEMultipart('related')
    for h in preserved_headers:
      newmessage.add_header(h,self.message[h])
    
    attachments = []
    for i in images:
      # Grab the image
      (mtype, data) = self.grab(i['src'])
      # Build a cid for it
      subtype = mtype.split('/')[1]
      cid = '%(count)d.%(subtype)s' % locals()
      # Create and attach new MIME part
      # we use all reference methods to ensure cross-MUA compatibility
      image = MIMEImage(data, subtype,name=cid)
      image.add_header('Content-ID', '<%s>' % cid)
      image.add_header('Content-Location', cid)
      image.add_header('Content-Disposition','inline', filename=("%s" % cid))       
      attachments.append(image)
      # update references to images
      i['src'] = '%s' % cid
      count = count + 1
    # inject rewritten HTML first
    part.set_payload(str(soup))
    newmessage.attach(part)
    # now add inline images as extra MIME parts
    for a in attachments:
      newmessage.attach(a)
    # replace the message
    self.message = newmessage
        
if __name__ == "__main__":
  a = emlx('320611.emlx')
  a.parse()
  a.save('injected.emlx')

Right now, I’m considering tweaking the plist flags a bit, and since I absolutely loathe the bright blue header uses to display feed items (which often hides large portions of item titles) I will be doing outright conversion to “normal” e-mail messages.

Plus, of course, I still need a decent way to invoke it upon an entire folder crammed with RSS items. That is easy enough to do, but I’d rather try to code something that can be re-used by other folk, and as such I’m looking into developing an action for this.

Time (my scarcest resource) will tell if it’s doable. Still, I wonder why doesn’t allow for archival of RSS items with inline images – it’s not as if they don’t have all the pieces (and already has plenty of RSS support…).

This page is referenced in: