February 16, 2016

356 words 2 mins read

Identical Docker, Identical OS – Crashing on 50% of the Hardware

Gather round l’il ones and let old man Jon tell you a little story about the time he learned to never say never. Twas not terribly long ago when he was trying to spin up a fleet of Squid Proxies. Being the hip DevOps guru he is, the proxies were stored safely inside of Docker containers. Those containers were all pulled from the same single source, so they were identical (as one would expect from Docker). It also happens that these Docker containers were being spun up on a pair of brand new AWS EC2 instances. Same AMI, same bootstrap script, same software updates. Everything on every container should be identical. However when it came time to spin up the containers… 50% of them failed. W. T. F?

The error was from Squid which crashed on startup. Of course being Docker, when Squid dies, the container dies. That’s expected functionality. But why is Squid dying?

Illegal instruction (core dumped)

That might seem like a vague error message, but for those who’ve been around a while it does point us in the right direction. Messages like “illegal instruction” typically indicate that the application is trying to tell the CPU to do something that it can’t do. In the modern era of JavaScript, Ruby, and Python, that’s almost never seen. However Squid is a bit more hardcore and is written in C/C++ where you can actually tell the CPU to do something “illegal”.

Screen Shot 2016-02-15 at 9.12.47 PM
It was confusing as all hell, that’s for sure. I’ve never seen a case where identical application stacks failed on only one of two brand new instances. However astute readers will have deduced that “identical” is actually “identical software”, but not necessarily identical hardware. In fact after only a little googling I found a Fedora bug report which was spot on for my troubles. The short version is that my particular version of Squid has a very particular bug in it. It also just so happens that AWS launched one instance on Intel Xeon Processor E5 v2 hardware (which was the crashing machine) and one instance on Intel® Xeon® Processor E5 v3 hardware (which worked).