The Internet of Broken Things

Posted by · February 28, 2017 12:09 pm

At around 12:30 p.m. EST on February 28th, Amazon Web Services’ S3 web server in the Northern Virginia region (the company’s largest) began experiencing “increased error rates.” And the whole world cried out in despair. S3, which is one of Amazon’s flagship offerings and historically the most stable, hosts 148,213 websites. Amazon itself has services that rely on hosting in the S3 server. So far the outage has created a ripple effect across the web.

Case in point: isitdownrightnow.com, a popular site for monitoring which sites are operational and which are not, is down right now.

Hours later, S3 is still down and the internet is still panicking. That’s because S3 hosts many B2B global programs needed for daily tasks, such as Quora, Imgur, and Atlassian platforms like Jira, Hipchat, Trello, and Slack. Even popular entertainment sites like Netflix — who famously migrated exclusively to AWS this time last year — are feeling the pain. Currently, most of these programs are experiencing errors, delays, or complete processing failures, causing a massive halt in productivity across the global workforce. (Which means workers have more time to run to Twitter to discuss this issue at rapid rates with entertaining memes.) 

[UPDATE] As of 2:08p.m. PST on February 28th, the latest update from Amazon Web Services stated, “As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.”

What it Means for Experts Exchange

Experts Exchange uses a hybrid cloud storage system, with server space spread across multiple AWS regions, leaving most of our site unaffected by the outage. The portion that is housed in Northern Virginia, however, is the system that stores user uploads of files, photos, and videos. These systems on our site are currently down. 

[UPDATE] All systems within Experts Exchange are fully functional as of 1:20p.m. PST on February 28th.

A word of advice from our very own Phil Phillips, director of DevOps on how to avoid catastrophe even if the system fails, “If you’re using AWS, don’t rely on a single region.”

For continuing updates: The AWS Service Health Dashboard is now operational, and you can track the situation here: https://status.aws.amazon.com/