Amazon S3 Outage: Another Opinion Piece

So Amazon S3 had some “issues” last week and it’s taken me a few days to put my thoughts together around this. Hopefully I’ve made the tail-end of the still interested-enough-to-find-this-blog-valuable period.

Trying to make the best of a bad situation, the good news, in my opinion, is that this shows that infrastructure people still have a place in the automated cloudy world of the future. At least that’s something right?

What happened:

You can read the detailed explanation on Amazon’s summary here.

In a nutshell

there was a small problem
they tried to fix it
things went bad for a relatively short time
They fixed it

What happened during:

The internet lost it’s minds. Or more accurately, some parts of the internet went down. Some of them extremely ironic

UNADJUSTEDNONRAW thumb bbfd

Initial thoughts

The reaction to this event is amusing and it drives home the point that infrastructure engineers are as critical as ever, if not even more important considering the complete lack of architecture that seems to have gone into the majority of these “applications”.

First let’s talk about availability: Looking at the Amazon AWS S3 SLA, available here, it looks like they did fall below there 99.9% SLA for availability. If we do a quick look at https://uptime.is/ we can see that for the monthly period, they were aiming for no more than 43m 49.7s of outage. Seems like they did about 6-8 hours of an outage so clearly they failed. Looking at the S3 SLA page, looks like customers might be eligible for 25% service credits. I’ll let you guys work that out with AWS.

Don’t “JUST CLICK NEXT”

One of the first things that struck me as funny here was the fact that this was the US-EAST-1 Region which was affected. US-EAST is the default region for most of the AWS services. You have to intentionally select another region if you want your service to be hosted somewhere else. But because it’s easier to just cllck next, it seems that the majority of people just clicked past that part and didn’t think about where they were actually hosting there services or the implications of hosting everything in the same region and probably the same availability zone. For more on this topic, take a look here.

There’s been a lot of criticism of the infrastructure people when anyone with a credit card can go to amazon sign up for a AWS account and start consuming their infrastructure. This has been thrown around like this is actually a good thing, right?

Well this is exactly what happens when “anyone” does that. You end up with all your eggs in one basket. (m/n in round numbers)

“Design your infrastructure for the four S’s. Stability Scalability, Security, and Stupidity” — Jeff Kabel

Again, this is not an issue with AWS, or any Cloud Providers offerings. This is an issue with people who think that infrastructure and architecture don’t matter and it can just be “automated” away. Automation is important, but it’s there so that your infrastructure people can free up some time from mind numbing tasks to help you properly architect the infra components your applications rely upon.

Why o Why o Why

Why anyone would architect their revenue generating system on an infrastructure that was only guaranteed to 99.9% is beyond me. The right answer, at least from an infrastructure engineers point of view is obvious, right?

You would use redundant architecture to raise the overall resilience of the application. Relying on the fact that it’s highly unlikely that you’re going to lose the different redundant pieces at the same time. Put simply, what are the chances that two different systems, both guaranteed to 99.9% SLA are going to go down at the exact same time?

Well doing some really basic probability calculations, and assuming the outages are independent events, we multiple the non-SLA’d time happening ( 0.001% ) in system 1 times the same metric in system 2 and we get.

0.001 * 0.001 = 0.000001 probability of both systems going down at the same time.

Or another way of saying that is 0.999999% of uptime. Pretty great right?

Note: I’m not an availability calculation expert, so if I’ve messed up a basic assumption here, someone please feel free to correct me. Always looking to learn!

So application people made the mistake of just signing over responsibility to “the cloud” for their application uptime, most of whom probably didn’t even read the SLA for the S3 service or sit down to think.

Really? We had people armed with an IDE and a credit card move our apps to “the cloud” and wonder why things failed.

What could they have done?

There’s a million ways to answer this I’m sure, but let’s just look at what was available within the AWS list of service offerings.

Cloudfront is AWS’s content delivery system. Extremely easy to use. Easy to setup and takes care of automatically moving your content to multiple AWS Regions and Availability Zones.

Route 53 is AWS’s DNS service that will allow you to perform health checks and only direct DNS queries to resources which are “healthy” or actively available.

There are probably a lot of other options as well, both within AWS and without, but my point is that the applications that went down most likely didn’t bother. Or they were denied the budget to properly architect resiliency into their system.

On the bright side, the latter just had a budget opening event.

Look who did it right

Unsurprisingly, there were companies who weathered the S3 storm like nothing happened. In fact, I was able to sit and binge watch Netflix well the rest of the internet was melting down. Yes, it looks like it cost 25% more, but then again, I had no problems with season 4 of Big Bang Theory at all last week, so I’m a happy customer.

Companies still like happy customers, don’t they?

@TheSteve0 @bryanrbeal @randybias the cost model estimated about 25% more cost for active-active. Insurance policy…

— Adrian Cockcroft – now @adrianco@mastodon.social (@adrianco) September 20, 2015

The Cloud is still a good thing

I’m hoping that no one reads this as a anti-cloud post. There’s enough anti-cloud rhetoric happening right now, which I suppose is inevitable considering last weeks highly visible outage, and I don’t want to add to that.

What I do want is for people who read this to spend a little bit of time thinking about their applications and the infrastructure that supports them. This type of thing happens in enterprise environments every day. Systems die. Hardware fails. Get over the it and design your architecture to take into consideration these failures as a foregone conclusion. It IS going to happen, it’s just a matter of when. So shouldn’t we design up front around that?

Alternately, we could also chose to take the risk for those services that don’t generate revenue for the business. If it’s not making you money, maybe you don’t want to pay for it to be resilient. That’s ok too. Just make an informed decision.

For the record, I’m a network engineer well versed in the arcane discipline of plumbing packets. Cloud and Application architectures are pretty far away from the land of BGP peering and routing tables where I spend my days. But for the low low price of $15 and a bit of time on Udemy, I was able to dig into AWS and build some skills that let me look at last weeks outage with a much more informed perspective. To all my infrastructure engineer peeps I highly encourage you to take the time, learn a bit, and get involved in these conversations at your companies. Hoping we can all raise the bar collectively together.

Comments, questions?

@netmanchris

Hi Chris, well analyzed and well said. A few thoughts:

First, you met me as a network person. Before that I was in storage, and before that in servers, and once upon a time I actually wrote application code. So I look at each of these situations from multiple points of view. The Amazon outage a couple of years ago was root caused to a network partition within their data center which caused lots of storage nodes to think they had the only remaining copy of certain data, triggering a storm of high priority work to make copies. This Amazon outage came from a drastic capacity reduction in the index and placement servers. In both cases, the underlying storage system (if you can think of hundreds of millions of dollars worth of equipment in a warehouse size data center as a single storage system) was not architected to limit the blast radius of a single failure or fat-finger mistake. Storage’s job is first to be sure it always has the data (and the data is always correct), and Amazon achieved that in both failures. Storage’s second job is to always have that data available to the application (hiding failures as best it can using RAID or other data redundancy as well as multipathing), and failing that to keep as many applications up as it can (isolating failures so that the blast radius of a failed disk or port, or more importantly a storage system software crash or corruption of a key table, affects the minimum number of ports and/or objects/files/LUNs). As a storage system architecture, Amazon has recognized it needs to work harder on the blast radius of the kinds of problems storage system people worked on 20 years ago.

Second, decades ago a very experienced IBM mainframe storage veteran gave me a way to think about not just log files but remote replication: assume the server goes insane as it’s going down. The key to recovery is analyzing the end of the log and finding the last point at which it is sane, and then restoring the application’s data at that point. This is really easy for a person who knows the data structures to do, and really hard to program (hence my enormous respect for the people who write that recovery code for Oracle). What this means is that if the application isn’t involved in vetting the data being replicated from Amazon here to Amazon there, it would be very easy for those last moments of insanity to corrupt the disaster recovery copy. This is *not* just an infrastructure problem!

Third, enough of the failure modes are common (same hardware design, same software, same processes used for operations) that failures in two separate 99.9% Amazon sites are not entirely independent. At least the uplinks go through different Internet backbone routers 🙂 although it wouldn’t surprise me if there were an anycast DDOS which could take down network connectivity to multiple Amazon sites simultaneously. (Not enough disclosed here to know for sure.)

-steve
@FStevenChalmers

2 thoughts on “Amazon S3 Outage: Another Opinion Piece”

fstevenchalmers says:

March 8, 2017 at 1:07 am

Hi Chris, well analyzed and well said. A few thoughts:

First, you met me as a network person. Before that I was in storage, and before that in servers, and once upon a time I actually wrote application code. So I look at each of these situations from multiple points of view. The Amazon outage a couple of years ago was root caused to a network partition within their data center which caused lots of storage nodes to think they had the only remaining copy of certain data, triggering a storm of high priority work to make copies. This Amazon outage came from a drastic capacity reduction in the index and placement servers. In both cases, the underlying storage system (if you can think of hundreds of millions of dollars worth of equipment in a warehouse size data center as a single storage system) was not architected to limit the blast radius of a single failure or fat-finger mistake. Storage’s job is first to be sure it always has the data (and the data is always correct), and Amazon achieved that in both failures. Storage’s second job is to always have that data available to the application (hiding failures as best it can using RAID or other data redundancy as well as multipathing), and failing that to keep as many applications up as it can (isolating failures so that the blast radius of a failed disk or port, or more importantly a storage system software crash or corruption of a key table, affects the minimum number of ports and/or objects/files/LUNs). As a storage system architecture, Amazon has recognized it needs to work harder on the blast radius of the kinds of problems storage system people worked on 20 years ago.

Second, decades ago a very experienced IBM mainframe storage veteran gave me a way to think about not just log files but remote replication: assume the server goes insane as it’s going down. The key to recovery is analyzing the end of the log and finding the last point at which it is sane, and then restoring the application’s data at that point. This is really easy for a person who knows the data structures to do, and really hard to program (hence my enormous respect for the people who write that recovery code for Oracle). What this means is that if the application isn’t involved in vetting the data being replicated from Amazon here to Amazon there, it would be very easy for those last moments of insanity to corrupt the disaster recovery copy. This is *not* just an infrastructure problem!

Third, enough of the failure modes are common (same hardware design, same software, same processes used for operations) that failures in two separate 99.9% Amazon sites are not entirely independent. At least the uplinks go through different Internet backbone routers 🙂 although it wouldn’t surprise me if there were an anycast DDOS which could take down network connectivity to multiple Amazon sites simultaneously. (Not enough disclosed here to know for sure.)

-steve
@FStevenChalmers

- netmanchris says:
  
  March 8, 2017 at 1:38 am
  
  As always, I sit down to read a Steve Chalmers post. 🙂 The comments are much appreciated and I couldn’t agree with you more that this is *not* an infrastructure problem. Or I guess I mean that this is not *just* an infrastructure problem, this is an architecture problem where infrastructure is a core piece of the puzzle that many organizations seem to be abdicating responsibility for because that’s “handled by the cloud”.
  
  Hopefully, people will from this, but unfortunately, I’m starting to lose faith in peoples, as a generalization, ability to learn from past mistkakes. Some will learn, evolve and move on. Others will just be destined to repeat the same thing over and over and wonder why there website is still down. 🙂
  
  @netmanchris