Automating Upload for Jekyll Website to Amazon S3

March 26 2013

As a follow-up to my earlier post on how I’m hosting this website on Amazon S3 and via Jekyll CMS, I’d like to describe how I’m uploading the contents to S3. I ended up writing a bash script that “publishes” the site and automates syncing with my S3 bucket.

I’m currently holding all of my Jekyll files (the raw files to generate this website) in a Dropbox folder. This way, I have access to the files anywhere I go, and I can make incremental updates to posts or write new ones.

When you run jekyll to compile your website, the entire site is re-generated from scratch. At first, I would just compile, then upload all the files directly to S3 (via s3cmd*). There’s a couple disadvantages to doing this. First, S3 charges for each request made (currently, the pricing is $0.01 for every 1000 PUT requests). My website is currently around 200 files. If the website gets bigger or I want to upload more often, then I will start seeing a charge for requests. The second disadvantage is that files could get left in the bucket if files get out of sync between my Dropbox folder and the S3 bucket. For instance, if I delete an image file in Dropbox, that file would still continue to live in S3.

* s3cmd is a great command line tool for working with your Amazon S3 buckets. It’s easy to set up too - I followed this quick guide from populationjim.com.

I considered using the syncing capability in s3cmd. This removes the issue of overwrite files that haven’t changed yet, since s3cmd figures that out for you. However, I still have the problem of a file getting left in S3 and not in Dropbox (I want to make sure that Dropbox is the “golden” source for the website and there are no files that would exist in one place and not the other).

I came up with a pretty simple solution. I wrote a bash script that compiles the jekyll code, checks to see which files are new since the last publish (and hence, need to be uploaded to S3), and errors out if it finds a file that was previously published but not in the current compile. If all is successful (i.e. it didn’t error out), then I create a copy of the website files and make that the “published” version to be checked against for next time.

The idea here is that in my typical use case for publishing, not many files change. Typically, it’s just the new blog post, some image files, and then summary pages (such as xingdig.com/blog/ or the topics pages) that need to be uploaded.

My bash script is shown below. At a high level, I first compile using Jekyll, then do a diff between the new compiled folder with the old published folder, then go line-by-line through the output of the diff to determine which files to be uploaded. Once I find a file to be uploaded, I run s3cmd to put the file in S3. Finally, I replace the “published” folder with the one I just compiled, so that next time I run the script, it’ll be comparing against this compile.

#!/bin/bash

# make sure to add "tagging_published" and "publish.sh" to the list of excludes in .config.yml
folder1=tagging
folder2=tagging_published
homedir=/Users/documents/website
cd $homedir

# first clean up the existing tagging folder and re-compile
rm -rf tagging
jekyll tagging

# do diff and exclude .DS_Store files, which are hidden metadata files in Mac OS
diff -B -b -r -q -x .DS_Store $folder1 $folder2 > diff_output

# read the diff output line by line and figure out which files need to be uploaded
while read f; do
	set -- $f # split the line by spaces
	if [ $1 == Only ];
	then
		if [[ $3 =~ ^tagging_published ]]; # file is in published version but not in current version
		then
			echo "Error! Found file in $folder2 with no match in $folder1"
			echo $f
			exit
		else # must be a file that is new
			temp="${3%?}"/$4
			s3cmd put -r ${temp:8} s3://bucketname/${temp:8} # strip out "tagging/"
		fi
	else # $2 is the name of the file that differs
		s3cmd put -r $2 s3://bucketname/${2:8}
	fi
done>diff_output

rm -f diff_output

# delete tagging_published and replace with latest published version
rm -rf $folder2
cp -r $folder1/ $folder2

Topics: TechnologyTechnology:WebsiteTechnology:AWS

« Previous Post