From RSS to my Kindle
Building website EPUBs with Python
Last year I wrote about how I built feedi, a personal feed reader, and started using it as my front page to the web. In the months since I published that post, I continued to tweak the app, observing my reading habits, experimenting with new features, and discarding the ones I didn’t need. I now got it to a place where I can count on seeing fresh and interesting content a couple of times a day, and the interface conveniently lets me keep what I plan to read and discard the rest.
But while I’m an avid reader on paper, I struggle with lack of concentration and eye strain when trying to read on a laptop or a desktop monitor —and it only gets worse on the phone. In practice, I use feedi as a mix of news feed, content finder, and organizer; I prefer to send longer blog posts or essays to my Kindle, so I can get back to them when I’m offline: in bed, in the bathroom, at a cafe or on the bus.
So a Kindle integration was a natural extension to feedi. Not only because it streamlined my reading workflow but because Amazon’s Chrome extension and iOS app do a poor job of extracting content from most websites. I was already getting better results with the readability library in feedi’s embedded article view, so I just needed to figure out how to send that cleaned-up HTML over to my Kindle. I learned a couple of things to get that working, so it seemed interesting to document the implementation process here.
My first instinct was to try to get away with a Kindle integration that didn’t require sending emails from my app. I found a Python library that “impersonated” an Amazon client and wrote my first implementation around it, but it turned out to be brittle: it required storing device credentials in the database and manually authenticating every few days, which hurt the user experience, ultimately discouraging me from using the feature at all.
So a few months later I took another stab at it, opting to send articles via email. At a high level, I needed to: fetch the HTML from the website, extract the cleaned-up article content from it, package it into an EPUB file, and attach it to an email to my Kindle device. This is what it looked like from the Flask route:
# feedi/routes.py
import flask
from flask import current_app as app
from flask_login import current_user, login_required
@app.post("/entries/kindle")
@login_required
def send_to_kindle():
url = flask.request.args['url']
article = scraping.extract(url)
attach_data = scraping.package_epub(url, article)
email.send(current_user.kindle_email, attach_data, filename=article['title'])
return '', 204
Let’s go through each of these steps. For the extraction, I tried every Python library I could find1, but none seemed to do as good of a job as Firefox’s reader view, so I decided to use the JavaScript library that powers it through a little Node.js script2:
#!/usr/bin/env node
// feedi/extract_article.js
const { JSDOM } = require("jsdom");
const { Readability } = require('@mozilla/readability');
const url = process.argv[2];
JSDOM.fromURL(url).then(function (dom) {
let reader = new Readability(dom.window.document);
let article = reader.parse();
process.stdout.write(JSON.stringify(article), process.exit);
});
The script’s output looks like this:
{
"title": "From RSS to my Kindle",
"byline": "Facundo Olano",
"content": "<div id=\"readability-page-1\" class=\"page\"><div lang=\"en\"><header><h3>Building website EPUBs with Python</h3></header><p>Last year I wrote about <a href=\"https://olano.dev/blog/reclaiming-the-web-with-a-personal-reader\">how I built feedi</a>, a personal feed reader, and started using it as my front page to the web. (...)",
"textContent": "Building website EPUBs with Python\n\nLast year I wrote about how I built feedi, a personal feed reader, and started using it as my front page to the web. (...)",
"length": 2793,
"excerpt": "A Kindle integration was a natural extension to my feed reader. I had to learn some subtleties to get it working, so it seemed interesting to document the implementation process.",
"siteName": "olano.dev"
}
And this is how I call it from Python:
# feedi/scraping.py
import json
import subprocess
def extract(url):
r = subprocess.run(["feedi/extract_article.js", url],
capture_output=True, text=True, check=True)
article = json.loads(r.stdout)
return article
I found that some websites rely on JavaScript to load images lazily, so I rewrote the tags to force them to render (both in the app and in Kindle):
import json
import subprocess
+from bs4 import BeautifulSoup
def extract_article(url):
r = subprocess.run(["feedi/extract_article.js", url],
capture_output=True, text=True, check=True)
article = json.loads(r.stdout)
+ # load lazy images by setting data-src into src
+ soup = BeautifulSoup(article['content'], 'lxml')
+ LAZY_DATA_ATTRS = ['data-src', 'data-lazy-src', 'data-srcset',
+ 'data-td-src-property']
+ for data_attr in LAZY_DATA_ATTRS:
+ for img in soup.findAll('img', attrs={data_attr: True}):
+ img.attrs = {'src': img[data_attr]}
+
+ article['content'] = str(soup)
return article
Next, I needed to put together a valid EPUB file from this HTML content. A very superficial research revealed that EPUB files are just zips with a few metadata files. So I zipping the article into a bytes sequence:
# feedi/scraping.py
import io
import zipfile
def package_epub(url, article):
output_buffer = io.BytesIO()
with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
zip.writestr('article.html', article['content'])
return output_buffer.getvalue()
Based on this sample repository I added mimetype, container, and content files pointing to the single article.html file, to turn it into an EPUB:
zip.writestr('mimetype', "application/epub+zip")
zip.writestr('META-INF/container.xml', """<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>""")
author = article['byline'] or article['siteName']
if not author:
# if no explicit author in the website, use the domain
author = urllib.parse.urlparse(url).netloc.replace('www.', '')
zip.writestr('content.opf', f"""<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="3.0" xml:lang="en" unique-identifier="uid" prefix="cc: http://creativecommons.org/ns#">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title id="title">{article['title']}</dc:title>
<dc:creator>{author}</dc:creator>
<dc:language>{article.get('lang', '')}</dc:language>
</metadata>
<manifest>
<item id="article" href="article.html" media-type="text/html" />
</manifest>
<spine toc="ncx">
<itemref idref="article" />
</spine>
</package>""")
This was enough to get the text working, but I needed to download the images if wanted them to show up on the Kindle:
import io
import zipfile
+from bs4 import BeautifulSoup
def package_epub(url, article):
output_buffer = io.BytesIO()
with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
- zip.writestr('article.html', article['content'])
+ soup = BeautifulSoup(article['content'], 'lxml')
+ for img in soup.findAll('img'):
+ img_url = img['src']
+ img_filename = 'article_files/' + img['src'].split('/')[-1].split('?')[0]
+
+ # update each img src url to point to the local copy of the file
+ img['src'] = img_filename
+
+ # download the image and save into the files subdir of the zip
+ response = requests.get(img_url)
+ if not response.ok:
+ continue
+ zip.writestr(img_filename, response.content)
+
+ zip.writestr('article.html', str(soup))
return output_buffer.getvalue()
Note how I also rewrite the img src
attributes so they point to the local files instead of online ones (much like the browser does when downloading a page). Since the Kindle can’t render WebP images, my next step was to convert those to JPEGs:
import io
import zipfile
from bs4 import BeautifulSoup
+from PIL import Image
def package_epub(url, article):
output_buffer = io.BytesIO()
with zipfile.ZipFile(output_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip:
soup = BeautifulSoup(article['content'], 'lxml')
for img in soup.findAll('img'):
img_url = img['src']
img_filename = 'article_files/' + img['src'].split('/')[-1].split('?')[0]
+ img_filename = img_filename.replace('.webp', '.jpg')
# update each img src url to point to the local copy of the file
img['src'] = img_filename
# download the image and save into the files subdir of the zip
response = requests.get(img_url)
if not response.ok:
continue
- zip.writestr(img_filename, response.content)
+ with zip.open(img_filename, 'w') as dest_file:
+ if img_url.endswith('.webp'):
+ jpg_img = Image.open(io.BytesIO(response.content)).convert("RGB")
+ jpg_img.save(dest_file, "JPEG")
+ else:
+ dest_file.write(response.content)
zip.writestr('article.html', str(soup))
Now I just needed to email this zip file. I didn’t want to depend on a paid service and remembered from my old web developer days that a regular Gmail account did the trick to send a few emails from a web app. Things had changed since the last time I’d tried this, though: I had to enable two-factor authentication and generate an “app password” (at https://myaccount.google.com/apppasswords
) for Google to accept my SMTP requests. This is what the email boilerplate looked like:
# feedi/email.py
import smtplib
import urllib.parse
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
def send(recipient, attach_data, filename):
server = "smtp.gmail.com"
port = 587
sender = "my.reader.email@gmail.com"
password = "some gmail app pass"
msg = MIMEMultipart()
msg['From'] = sender
msg['To'] = recipient
msg['Subject'] = f'feedi - {filename}'
part = MIMEBase('application', 'epub')
part.set_payload(attach_data)
encoders.encode_base64(part)
Where attach_data
is the EPUB zip byte sequence.
The Kindle uses the filename from the Content-Disposition
header as the title displayed in the device library; this is a problem when the title contains spaces or non-ASCII characters —as is the case for Spanish articles. I got that working after a few tries with the escaping syntax suggested by this StackOverflow answer:
filename = urllib.parse.quote(filename)
part.add_header('Content-Disposition', f"attachment; filename*=UTF-8''{filename}.epub")
msg.attach(part)
Finally, the email is sent like this:
smtp = smtplib.SMTP(server, port)
smtp.ehlo()
smtp.starttls()
smtp.login(sender, password)
smtp.sendmail(sender, recipient, msg.as_string())
smtp.quit()
Of course, for the Kindle to accept it, I had to whitelist the reader email address in my Amazon device settings.
This implementation works well enough for my needs, but there’s still room for improvement:
- Some websites regrettably rely on JavaScript to load their HTML, so it’s not picked up by the readability package. I experimented with a headless browser to fetch the content, but that made the app slow and brittle, so I just choose not to read content from JavaScript-centric websites. (A similar rule applies to paywalls).
- This Kindle integration feature is very convenient when using feedi, but I’d also want to use it from the browser. Right now I need to copy the URL and paste it into feedi, but I’m toying with the idea of a Firefox extension that would work similarly to Amazon’s one —and that could also be used for other URL operations, like RSS feed discovery.
- Similarly, I’d like feedi, which is already a Progressive Web App, to work as a share target in my phone, so it can receive URLs from other applications. Unfortunately, this feature is not supported in iOS.
Notes
I could have called the library from the browser instead, saving me from this additional Node.js dependency, but I preferred the extra complexity on the server over adding scripting to an otherwise declarative htmx client. The server-side approach also allows me to pre-fetch article content in the background.