Remote Mass Image Optimization with AWS CLI
“Remote Mass Image Optimization with AWS CLI”, now there’s a title! So what is it? Well it’s basically a term I came up with to describe a problem I recently encountered with a client’s website. For years their site allowed users to upload any image they wanted, with no restrictions on dimensions or file size.
Most problematically, these images were then displayed on public website pages where SEO was a factor! Needless to say, this provided a terrible user experience (and PageSpeed score) because visitors were having to download images that were multiple megabytes in size, and most certainly affected page ranking. Below, you’ll see the steps I took to carry out the mass image optimization using Amazon’s AWS CLI and ImageMagick.
Fix the SOURCE of the Problem
The first thing to do was to stop huge images from being uploaded to the server. There were two possible routes to go here:
1) add validation that checks the filesize and image dimensions, and if over a certain threshold, inform the user they can only upload images of a certain dimension and/or filesize.
2) still allow the user to upload anything they want, but then optimize and resize in postback processing.
Always looking to provide the best user experience, #2 was the preferred solution. The file upload postback routine was modified to resize and convert images if they were over a certain width and/or filesize. Also, if a PNG was over a certain filesize after resizing, it was converted to JPG.
Fix the SYMPTOM of the Problem
With the SOURCE of the problem out of the way, I then had to deal with the SYMPTOM of the problem, which were the thousands of images that had been uploaded over the past years.
There were images that were over 4000 pixels wide, and up to 7 megabytes. It was bad. To further complicate things, all these images were stored remotely in an AWS S3 bucket (Amazon “Simple Storage Services”).
My research revealed that you cannot directly manipulate images in an S3 bucket. You must download them locally, make changes, and then re-upload. I found the best way to do this was using the AWS CLI tools (version 2), which allows you to copy and sync entire bucket contents.
Install Amazon AWS CLI version 2
To install AWS CLI version 2, go to this page and install as noted for your OS.
https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
Since I use linux, all my command examples will be for that environment.
Step 1: connect to your bucket
> aws configure
This will prompt you for your Access Key ID and Secret Access Key. Optionally, and may be required for your situation, you’ll need to provide Default region name and Default output format.
Step 2: see if the connection is working by reading the contents of your bucket:
> aws s3 ls
You should see a list of your buckets. For a general CLI references, go here:
https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html
Download (sync) the Remote Bucket
To make an exact copy of the remote S3 bucket on your local machine, use the “sync” command.
This command uses the format of aws:
<protocol> sync <protocol>://<bucket name> <local path>
Let’s call the s3 bucket “remote_bucket” and our local copy “local_bucket”, so the command would be:
> aws s3 sync s3://remote_bucket /home/me/local_bucket
Note:
Once you have your local copy, make a backup of the original bucket files in case you need to do a complete restore.
Convert, Resize, and Optimize Images
Now that you have a copy of the bucket on your local machine, image resizing and optimization can take place. For this my research led to ImageMagick, which seems to be the defacto open source standard for any image manipulation possible. Install ImageMagick on your local machine
:https://imagemagick.org/script/download.php
(for linux: most package managers should have it (e.g. yum, apt, etc.))
The next set of commands will take place at the top level of your local bucket copy.
For example, if you ran:
> aws s3 sync s3://remote_bucket /home/me/local_bucket
You would:
> cd /home/me/local_bucket
and run all commands from there.
And I will be referring to the local copy of the bucket files as ‘local_bucket’ from this point forward.
All of the commands shown below run recursively by using the “find” command and regex for filtering. The backslashes escape characters that have other uses on the command line. The output of “find” is then passed to ImageMagick’s “mogrify” command. “mogrify” will overwrite the source file with the output, which is what we want for the JPGs. For PNG, they are written to new JPG files so will not overwrite the original PNG. You’ll have to clean up PNGs according to your circumstances afterwards.
The first thing I did was convert all PNG to JPG. The majority of the PNGs were of photos taken on digital cameras, but you may have some logo images. As mentioned above, the original PNGs won’t be touched and will remain in the bucket. You may need to write a script to selectively remove converted PNGs, or original PNGs, depending on your circumstances.
Convert PNG to JPG
> find . -regextype posix-egrep -regex ".*\.(png|PNG)$" -print0 | xargs -0 mogrify -background white -flatten -format jpg -auto-orient
Note:
PNG supports transparency, and JPG does not. In the example above the transparency is converted to white for the JPG version.
Resize & Optimize JPGs
> find . -regextype posix-egrep -regex ".*\.(jpg|JPG|jpeg|JPEG)$" -print0 | xargs -0 mogrify -filter Triangle -define filter:support=2 -thumbnail 600x\> -unsharp 0.25x0.08+8.3+0.045 -dither None -posterize 136 -quality 90 -define jpeg:fancy-upsampling=off -interlace none -colorspace sRGB -auto-orient
Notes:
-thumbnail 600x> is the parameter that resizes the image, but only if the image width is greater than 600px. Please refer to the ImageMagick documentation for resizing options.
There are a lot of parameters in the command above, and are beyond the scope of this overview, but there is plenty of info on all ImageMagick has to offer. You should do test runs and examine the output and determine if you need to adjust parameters for your needs.
Sync Back to S3 Bucket
For this operation, you must be one level above the local_bucket. So using our example, with our local path being:
/home/me/local_bucket
we would:
> cd /home/me
Then, to sync everything back to the S3 bucket:
> aws s3 sync local_bucket s3://remote_bucket
Additional Considerations
Once the files have been updated, you must then consider any PNGs that were converted. Any web content referring to the old PNGs must be updated to use the new JPG versions of the images. In my case, webpage content was stored in the database. REGEX is always your friend in these types of situations. We know the files of interest are stored in an S3 bucket, and we know they end in either .png or .PNG. So a typical image SRC might look like (ex):
https://s3.amazonaws.com/remote_bucket/upload/123/images/caterpillar.png
We only want to change image SRCs that start with
“s3.amazonaws.com/remote_bucket/”
and end with
“.png” or “.PNG”.
For maximum control, flexibility, and logging, I wrote a quick PHP routine to do this.
// find all webpage recs with s3 .png imgs // ex. src="//s3.amazonaws.com/remote_bucket/upload/123/images/caterpillar.png" $mysql_regex = 's3.amazonaws.com/remote_bucket/upload/[0-9]{1,3}/images/[^"]+\.(png|PNG)'; $limit = 10; // set as desired to view initial results and for testing $sql = <<; $recs = $db->query($sql); echo "REC COUNT: ".count($recs).""; $regex = '/(s3\.amazonaws\.com\/remote_bucket\/upload\/[0-9]{1,3}\/images\/.+?\.)(png)/i'; echo ''; foreach($recs as $r) { $content = $r['wp']['content']; $wp_id = $r['wp']['id']; #echo "==========================================\nCONTENT BEFORE\n".$content."\n\n"; #preg_match_all($regex, $content, $pma_matches); #echo "==========================================\nPREG_MATCH_ALL\n".print_r($pma_matches,1)."\n\n"; $content = preg_replace($regex, '$1jpg', $content); #echo "\n\n\n==========================================\nCONTENT AFTER\n".$content."\n\n"; $db->query("UPDATE webpages SET content = :content WHERE id = :wp_id", ['content' => $content, 'wp_id' => $wp_id]); echo "Updated page: id={$wp_id}, name={$r['wp']['name']}"; } echo "Op Complete on ".date('m/d/Y @ h:iA');
A closer look at the PHP REGEX:
$regex = '/(s3\.amazonaws\.com\/remote_bucket\/upload\/[0-9]{1,3}\/images\/.+?\.)(png)/i';
The main thing to note here, and it’s incredibly important, is the question mark. This makes the match non-greedy. Imagine if you had this line with two images:
<img src=”http://s3.amazonaws.com/remote_bucket/upload/123/images/caterpillar.png"> <img src=”http://s3.amazonaws.com/remote_bucket/upload/123/images/beetle.png”>
Without the question mark, REGEX would match everything after the first SRCs “images/” and the second SRCs “.png”. You’d wind up with only the last image tag’s SRC being updated, e.g.:
<img src=”http://s3.amazonaws.com/remote_bucket/upload/123/images/caterpillar.png”> <img src=”http://s3.amazonaws.com/remote_bucket/upload/123/images/beetle.jpg”>
Conclusion
This should set you on the right path for updating all your S3 bucket images, or even images in local storage. ImageMagick really is a great open source tool, and can modify images in almost any way conceivable. You can also install it on your server and enable it for use within PHP.