Exporting Wordpress Posts To Markdown -- Prefetch Technologies

I’ve been running my technology blog on top of Wordpress for the past 12-years. It was a great choice when i started but the core product has morfed into more than I need. When you combine that with a constant stream of security vulnerabilities I decided last month it was time to move to a static website generation tool. Like any new venture I sat down one Saturday morning and jotted down the requirements for my new website generator:

Ability to describe my content through markdown
Seperation between content and the layout of the content
Theme support
Tooling to export my existing wordpress posts to markdown
Thriving user community
Utilize disqus for comments
Quick content generation

I experimented with Jekyl, Pelican and Hugo and after several weeks of testing I fell in love with Hugo. Not only was it super easy to install (it’s a single binary written in GO) but I had the bulk of my website converted after watching the Hugo video series from Giraffe Academy:

The biggest challenge I faced was getting all of my old posts (1200+) out of my existing Wordpress installation. Pelican comes with the pelican-import utility which can take a Wordpress XML export file and convert each post to markdown. Even though I decided to use Hugo to create my content I figured I would use the best tool for the job to perform the conversion:

$ pelican-import -m markdown --wpfile -o posts blogomatty.xml

In the example above I’m passing a file that I exported through the Wordpress UI and generating one markdown file in the posts directory for each blog post. The output files had the following structure:

Title: Real world uses for OpenSSL
Date: 2005-02-13 23:42
Author: admin
Category: Articles, Presentations and Certifications
Slug: real-world-uses-for-openssl
Status: published

If you are interested in learning more about all the cool things you can
do with OpenSSL, you might be interested in my article [Real world uses
for OpenSSL](/articles/realworldssl.html). The article covers
encryption, decryption, digital signatures, and provides an overview of 
[ssl-site-check](/code) and [ssl-cert-check](/code).

These files didn’t work correctly out of the gate since Hugo requires you to encapsulate the front matter (the metadata describing the post) with “—” for markdown or “+++” for TOML formatting. To add the necessary formatting I threw together a bit of shell:

#!/bin/sh

for post in `ls posts_to_process`; do
   echo "Processing post ${post}"
   echo "---" > posts_processed/${post}.md
   header=0
   cat "posts_to_process/${post}" | while read line; do
       if echo $line | egrep -i "^Status:" > /dev/null; then
            echo "$line"
            echo "---" >> posts_processed/${post}.md
            header=1
       elif [ ${header} -eq 1 ]; then
           echo $line >> posts_processed/${post}.md
       elif echo $line | egrep -i "^Title:" > /dev/null; then
            echo $line | awk -F':' '{print $2$3}' | sed 's/^ *//g' | sed 's/"/\\"/g' | \
                 awk '{ print "title:", "\""$0"\"" }' >> posts_processed/${post}.md
       else
           echo $line >> posts_processed/${post}.md
       fi
   done
done

This takes the existing post and appends a “—” before and after the front matter. It also escapes quotes and addresses titles that have a single “:” in them. My posts still had issues with the date format and the author wasn’t consistent. To clean up the date I used my good buddy sed:

$ sed -i 's/Date: $.*$ $.*$/Date: \1T\2:00-04:00/g'

To fix the issue with the author I once again turned to sed:

$ sed -i 's/^[Aa]uthor.*/author: matty/'

I had to create a bunch of additional hacks to work around some content consistency issues (NB: content consistency is my biggest take away from this project) but the end product is a blog that runs from statically generated content. In a future post I will dive into Hugo and the gotchas I encountered while converting my site. It was a painful process but luckily the worst is behind me. Now I just need to finish automating a couple manual processes and blogging will be fun again.

This article was posted by Matty on 2017-11-24 10:22:36 -0500 -0500