From RSS to my Kindle

Building website EPUBs with Python

2024-06-07 #web #projects #programming

Last year I wrote about how I built feedi, a personal feed reader, and started using it as my front page to the web. In the months since I published that post, I continued to tweak the app, observing my reading habits, experimenting with new features, and discarding the ones I didn’t need. I now got it to a place where I can count on seeing fresh and interesting content a couple of times a day, and the interface conveniently lets me keep what I plan to read and discard the rest.

But while I’m an avid reader on paper, I struggle with lack of concentration and eye strain when trying to read on a laptop or a desktop monitor —and it only gets worse on the phone. In practice, I use feedi as a mix of news feed, content finder, and organizer; I prefer to send longer blog posts or essays to my Kindle, so I can get back to them when I’m offline: in bed, in the bathroom, at a cafe or on the bus.

So a Kindle integration was a natural extension to feedi. Not only because it streamlined my reading workflow but because Amazon’s Chrome extension and iOS app do a poor job of extracting content from most websites. I was already getting better results with the readability library in feedi’s embedded article view, so I just needed to figure out how to send that cleaned-up HTML over to my Kindle. I learned a couple of things to get that working, so it seemed interesting to document the implementation process here.

My first instinct was to try to get away with a Kindle integration that didn’t require sending emails from my app. I found a Python library that “impersonated” an Amazon client and wrote my first implementation around it, but it turned out to be brittle: it required storing device credentials in the database and manually authenticating every few days, which hurt the user experience, ultimately discouraging me from using the feature at all.

So a few months later I took another stab at it, opting to send articles via email. At a high level, I needed to: fetch the HTML from the website, extract the cleaned-up article content from it, package it into an EPUB file, and attach it to an email to my Kindle device. This is what it looked like from the Flask route:

# feedi/routes.py

import flask

from flask import current_app as app
from flask_login import current_user, login_required

@app.post("/entries/kindle")
@login_required
def send_to_kindle():
    url = flask.request.args['url']
    article = scraping.extract(url)
    attach_data = scraping.package_epub(url, article)
    email.send(current_user.kindle_email, attach_data, filename=article['title'])
    return '', 204

Let’s go through each of these steps. For the extraction, I tried every Python library I could find¹, but none seemed to do as good of a job as Firefox’s reader view, so I decided to use the JavaScript library that powers it through a little Node.js script²:

#!/usr/bin/env node
// feedi/extract_article.js

const { JSDOM } = require("jsdom");
const { Readability } = require('@mozilla/readability');

const url = process.argv[2];

JSDOM.fromURL(url).then(function (dom) {
  let reader = new Readability(dom.window.document);
  let article = reader.parse();
  process.stdout.write(JSON.stringify(article), process.exit);
});

The script’s output looks like this:

{
  "title": "From RSS to my Kindle",
  "byline": "Facundo Olano",
  "content": "<div id=\"readability-page-1\" class=\"page\"><div lang=\"en\"><header><h3>Building website EPUBs with Python</h3></header><p>Last year I wrote about <a href=\"https://olano.dev/blog/reclaiming-the-web-with-a-personal-reader\">how I built feedi</a>, a personal feed reader, and started using it as my front page to the web. (...)",
  "textContent": "Building website EPUBs with Python\n\nLast year I wrote about how I built feedi, a personal feed reader, and started using it as my front page to the web. (...)",
  "length": 2793,
  "excerpt": "A Kindle integration was a natural extension to my feed reader. I had to learn some subtleties to get it working, so it seemed interesting to document the implementation process.",
  "siteName": "olano.dev"
}

And this is how I call it from Python:

# feedi/scraping.py
import json
import subprocess

def extract(url):
    r = subprocess.run(["feedi/extract_article.js", url],
                       capture_output=True, text=True, check=True)

    article = json.loads(r.stdout)
    return article

I found that some websites rely on JavaScript to load images lazily, so I rewrote the tags to force them to render (both in the app and in Kindle):

 import json
 import subprocess

+from bs4 import BeautifulSoup

 def extract_article(url):
     r = subprocess.run(["feedi/extract_article.js", url],
                        capture_output=True, text=True, check=True)

     article = json.loads(r.stdout)

+    # load lazy images by setting data-src into src
+    soup = BeautifulSoup(article['content'], 'lxml')
+    LAZY_DATA_ATTRS = ['data-src', 'data-lazy-src', 'data-srcset',
+                       'data-td-src-property']
+    for data_attr in LAZY_DATA_ATTRS:
+        for img in soup.findAll('img', attrs={data_attr: True}):
+            img.attrs = {'src': img[data_attr]}
+
+    article['content'] = str(soup)

     return article

Next, I needed to put together a valid EPUB file from this HTML content. A very superficial research revealed that EPUB files are just zips with a few metadata files. So I zipping the article into a bytes sequence:

# feedi/scraping.py
import io
import zipfile

def package_epub(url, article):
    output_buffer = io.BytesIO()
    with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
        zip.writestr('article.html', article['content'])

    return output_buffer.getvalue()

Based on this sample repository I added mimetype, container, and content files pointing to the single article.html file, to turn it into an EPUB:

zip.writestr('mimetype', "application/epub+zip")
zip.writestr('META-INF/container.xml', """<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>""")

author = article['byline'] or article['siteName']
if not author:
    # if no explicit author in the website, use the domain
    author = urllib.parse.urlparse(url).netloc.replace('www.', '')

zip.writestr('content.opf', f"""<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="3.0" xml:lang="en" unique-identifier="uid" prefix="cc: http://creativecommons.org/ns#">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title id="title">{article['title']}</dc:title>
<dc:creator>{author}</dc:creator>
<dc:language>{article.get('lang', '')}</dc:language>
</metadata>
<manifest>
<item id="article" href="article.html" media-type="text/html" />
</manifest>
<spine toc="ncx">
<itemref idref="article" />
</spine>
</package>""")

This was enough to get the text working, but I needed to download the images if wanted them to show up on the Kindle:

 import io
 import zipfile

+from bs4 import BeautifulSoup

 def package_epub(url, article):
     output_buffer = io.BytesIO()
     with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
-        zip.writestr('article.html', article['content'])
+        soup = BeautifulSoup(article['content'], 'lxml')
+        for img in soup.findAll('img'):
+            img_url = img['src']
+            img_filename = 'article_files/' + img['src'].split('/')[-1].split('?')[0]
+
+            # update each img src url to point to the local copy of the file
+            img['src'] = img_filename
+
+            # download the image and save into the files subdir of the zip
+            response = requests.get(img_url)
+            if not response.ok:
+                continue
+            zip.writestr(img_filename, response.content)
+
+        zip.writestr('article.html', str(soup))
     return output_buffer.getvalue()

Note how I also rewrite the img src attributes so they point to the local files instead of online ones (much like the browser does when downloading a page). Since the Kindle can’t render WebP images, my next step was to convert those to JPEGs:

 import io
 import zipfile

 from bs4 import BeautifulSoup
+from PIL import Image

 def package_epub(url, article):
     output_buffer = io.BytesIO()
     with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
         soup = BeautifulSoup(article['content'], 'lxml')
         for img in soup.findAll('img'):
             img_url = img['src']
             img_filename = 'article_files/' + img['src'].split('/')[-1].split('?')[0]
+            img_filename = img_filename.replace('.webp', '.jpg')

             # update each img src url to point to the local copy of the file
             img['src'] = img_filename

             # download the image and save into the files subdir of the zip
             response = requests.get(img_url)
             if not response.ok:
                 continue

-            zip.writestr(img_filename, response.content)
+            with zip.open(img_filename, 'w') as dest_file:
+                if img_url.endswith('.webp'):
+                    jpg_img = Image.open(io.BytesIO(response.content)).convert("RGB")
+                    jpg_img.save(dest_file, "JPEG")
+                else:
+                    dest_file.write(response.content)

         zip.writestr('article.html', str(soup))

Now I just needed to email this zip file. I didn’t want to depend on a paid service and remembered from my old web developer days that a regular Gmail account did the trick to send a few emails from a web app. Things had changed since the last time I’d tried this, though: I had to enable two-factor authentication and generate an “app password” (at https://myaccount.google.com/apppasswords) for Google to accept my SMTP requests. This is what the email boilerplate looked like:

# feedi/email.py
import smtplib
import urllib.parse
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart

def send(recipient, attach_data, filename):
    server = "smtp.gmail.com"
    port = 587
    sender = "my.reader.email@gmail.com"
    password = "some gmail app pass"

    msg = MIMEMultipart()
    msg['From'] = sender
    msg['To'] = recipient
    msg['Subject'] = f'feedi - {filename}'

    part = MIMEBase('application', 'epub')
    part.set_payload(attach_data)
    encoders.encode_base64(part)

Where attach_data is the EPUB zip byte sequence.

The Kindle uses the filename from the Content-Disposition header as the title displayed in the device library; this is a problem when the title contains spaces or non-ASCII characters —as is the case for Spanish articles. I got that working after a few tries with the escaping syntax suggested by this StackOverflow answer:

filename = urllib.parse.quote(filename)
part.add_header('Content-Disposition', f"attachment; filename*=UTF-8''{filename}.epub")
msg.attach(part)

Finally, the email is sent like this:

smtp = smtplib.SMTP(server, port)
smtp.ehlo()
smtp.starttls()
smtp.login(sender, password)
smtp.sendmail(sender, recipient, msg.as_string())
smtp.quit()

Of course, for the Kindle to accept it, I had to whitelist the reader email address in my Amazon device settings.

This implementation works well enough for my needs, but there’s still room for improvement:

Some websites regrettably rely on JavaScript to load their HTML, so it’s not picked up by the readability package. I experimented with a headless browser to fetch the content, but that made the app slow and brittle, so I just choose not to read content from JavaScript-centric websites. (A similar rule applies to paywalls).
This Kindle integration feature is very convenient when using feedi, but I’d also want to use it from the browser. Right now I need to copy the URL and paste it into feedi, but I’m toying with the idea of a Firefox extension that would work similarly to Amazon’s one —and that could also be used for other URL operations, like RSS feed discovery.
Similarly, I’d like feedi, which is already a Progressive Web App, to work as a share target in my phone, so it can receive URLs from other applications. Unfortunately, this feature is not supported in iOS.

Notes

newspaper3k, news-please, goose3, trafilatura, ReadabiliPy, python-readability.

I could have called the library from the browser instead, saving me from this additional Node.js dependency, but I preferred the extra complexity on the server over adding scripting to an otherwise declarative htmx client. The server-side approach also allows me to pre-fetch article content in the background.

← Notes about Go

A Note on Essential Complexity →