Exporting Wordpress Posts To Markdown


I’ve been running my technology blog on top of Wordpress for the past 12-years. It was a great choice when i started but the core product has morfed into more than I need. When you combine that with a constant stream of security vulnerabilities I decided last month it was time to move to a static website generation tool. Like any new venture I sat down one Saturday morning and jotted down the requirements for my new website generator:

I experimented with Jekyl, Pelican and Hugo and after several weeks of testing I fell in love with Hugo. Not only was it super easy to install (it’s a single binary written in GO) but I had the bulk of my website converted after watching the Hugo video series from Giraffe Academy:

The biggest challenge I faced was getting all of my old posts (1200+) out of my existing Wordpress installation. Pelican comes with the pelican-import utility which can take a Wordpress XML export file and convert each post to markdown. Even though I decided to use Hugo to create my content I figured I would use the best tool for the job to perform the conversion:

$ pelican-import -m markdown --wpfile -o posts blogomatty.xml

In the example above I’m passing a file that I exported through the Wordpress UI and generating one markdown file in the posts directory for each blog post. The output files had the following structure:

Title: Real world uses for OpenSSL
Date: 2005-02-13 23:42
Author: admin
Category: Articles, Presentations and Certifications
Slug: real-world-uses-for-openssl
Status: published

If you are interested in learning more about all the cool things you can
do with OpenSSL, you might be interested in my article [Real world uses
for OpenSSL](/articles/realworldssl.html). The article covers
encryption, decryption, digital signatures, and provides an overview of 
[ssl-site-check](/code) and [ssl-cert-check](/code).

These files didn’t work correctly out of the gate since Hugo requires you to encapsulate the front matter (the metadata describing the post) with “—” for markdown or “+++” for TOML formatting. To add the necessary formatting I threw together a bit of shell:

#!/bin/sh

for post in `ls posts_to_process`; do
   echo "Processing post ${post}"
   echo "---" > posts_processed/${post}.md
   header=0
   cat "posts_to_process/${post}" | while read line; do
       if echo $line | egrep -i "^Status:" > /dev/null; then
            echo "$line"
            echo "---" >> posts_processed/${post}.md
            header=1
       elif [ ${header} -eq 1 ]; then
           echo $line >> posts_processed/${post}.md
       elif echo $line | egrep -i "^Title:" > /dev/null; then
            echo $line | awk -F':' '{print $2$3}' | sed 's/^ *//g' | sed 's/"/\\"/g' | \
                 awk '{ print "title:", "\""$0"\"" }' >> posts_processed/${post}.md
       else
           echo $line >> posts_processed/${post}.md
       fi
   done
done

This takes the existing post and appends a “—” before and after the front matter. It also escapes quotes and addresses titles that have a single “:” in them. My posts still had issues with the date format and the author wasn’t consistent. To clean up the date I used my good buddy sed:

$ sed -i 's/Date: \(.*\) \(.*\)/Date: \1T\2:00-04:00/g'

To fix the issue with the author I once again turned to sed:

$ sed -i 's/^[Aa]uthor.*/author: matty/'

I had to create a bunch of additional hacks to work around some content consistency issues (NB: content consistency is my biggest take away from this project) but the end product is a blog that runs from statically generated content. In a future post I will dive into Hugo and the gotchas I encountered while converting my site. It was a painful process but luckily the worst is behind me. Now I just need to finish automating a couple manual processes and blogging will be fun again.

This article was posted by Matty on 2017-11-24 10:22:36 -0500 -0500