Fastly outage explained: How one customer broke Amazon, Reddit and the wider internet


The internet was brought to its knees on Tuesday, with 503 errors on news outlets and websites. Now we know exactly what happened.

Not Fastley’s proudest moment.

Tuesday will be remembered as the day the internet was broken – before a speedy recovery again. Early this morning, websites including Amazon, Reddit, Spotify, Ebay, Twitch, Pinterest and, unfortunately, Nerdshala went offline due to a major outage in a service called Fastly. Everywhere you looked, there were 503 errors and people were complaining they couldn’t access major services and news outlets. All this demonstrates how much of the Internet depends on this unheard of cloud computing service.

After investigating what happened, Fastly published a blog post exactly what went down — and it turned out that the entire incident was triggered by just one, unnamed Fastly customer.

In mid-May, Fastly released a software deployment that contained a bug that, if triggered in specific circumstances, could take down large parts of its network. The bug lay dormant until June 8, when a Fastly customer inadvertently triggered the bug during a “valid configuration change”, which caused 85% of the company’s network to return errors.

“We detected the disruption within 1 minute, then identified and isolated the cause, and disabled the configuration,” said Nick Rockwell, Fastly’s senior vice president of engineering and infrastructure, in a blog post. “Within 49 minutes, 95% of our network was operating normally. This outage was widespread and severe, and we are truly sorry for the impact to our customers and everyone who depends on them.”

What happened during the Fastly outage?

At around 2:58 p.m. PT, Fastly’s status update page noted an error saying, “We are currently investigating the potential impact on performance with our CDN. [content delivery network] Services.” Soon after, reports of major news publications including the BBC, CNN and The New York Times going offline surfaced on Twitter. Twitter was still running, although the server hosting its emoji went down, causing some strange Visible tweets.

Instead of isolated incidents affecting different sites, it turned out that it was a massive outage that brought the Internet to its knees. Worldwide, people were receiving Error: 503 messages when trying to access sites including certain critical services, such as the UK government’s gov.uk web property.

About an hour later, at 3:44 a.m. PT – or 6:44 a.m. ET, the end of the US East Coast weekday, and coming to noon in the UK – swiftly updated its status page again to say this issue Do has been identified and a fix was being implemented. At 4:10 a.m. PT, the company tweeted: “We have identified a service configuration that caused disruption to our POP globally and have disabled that configuration. Our global network is coming back online.”

The same message was sent to Nerdshala as a comment by spokespersons for Fastly.

What’s fast?

Fastly is a cloud computing service provider headquartered in San Francisco since 2011. In 2017, it launched an Edge cloud platform designed to bring websites closer to the people who use them. This effectively means that if you are accessing a website hosted in another country, it will store some of that website closer to you so that all the content on that website can be accessed every time you need it. Do it remotely, no need to waste bandwidth.

This makes for faster website load times, and optimizes images, videos, and other high-payload content to show up quickly and easily when you visit a web page. Among the claims on the company’s website, it says it created 50% faster loading pages on BuzzFeed and allowed the New York Times to simultaneously handle 2 million readers on election night. Edge computing also performs important cybersecurity functions, protecting sites from DDoS attacks and bots, as well as providing a web application firewall.

The way Fastly sits between the back-end web server and the front-end internet, as we see, any error on its part can cause entire websites to become unavailable. Due to the localized nature of the Edge cloud platform, this also means that the errors do not equally affect all regions at the same time (although people around the world reported experiencing problems on Tuesday).

What is 503 error?

When you see a website displaying a 503 error instead of showing the page you were expecting, it means that the server hosting the website is not ready to handle the request. It also indicates that the problem is temporary and will be resolved soon.

Typically, this happens when a server is down for maintenance, or when a website is overloaded – for example, if too many people are trying to access it at once.

screenshot-2021-06-08-at-12-08-53.png

Delivers service updates rapidly throughout the outage.

Why did Fastly fail on Tuesday and will it happen again?

We now know that Tuesday’s Internet outage was caused by a service configuration change by one of Fastly’s customers that triggered a hidden bug in Fastly’s network. The bug had been lying dormant since the software update deployment by Fastly on May 6.

Many speculated on Twitter that the outage was caused by a cyberattack, but we now know for sure that this was not the case. There are many technical reasons why CDNs fail, and cyber attacks are one of them.

To make sure the problem doesn’t repeat itself, Fastly says it’s taking a number of actions. It is deploying a bug fix to its network, as well as a full post-mortem of procedures and practices during the incident. It’s also going to explore why it didn’t catch the bug during its testing procedures and evaluate ways to improve healing times.

“Although there were specific conditions that triggered this outage, we should have anticipated it,” Rockwell said. “We provide mission critical services, and we treat any action that may cause service issues with the utmost sensitivity and priority.”

Why were so many websites affected by the Fastly outage?

Fast is a widely used service by web publishers — and it became abundantly clear on Tuesday just how widely it was used when the Internet was largely unavailable Tuesday.

The reason it is so popular is that the services it provides are considered essential by many online web properties, but many companies do not provide these services. As such, a large number of websites rely on a very small group of companies to run. Similar problems were observed when Cloudflare was hit with an outage last July, and when Amazon Web Services went down last November.

As Corinne Cath-Speth, a Ph.D. Candidate at Oxford Internet Institute and Alan Turing Institute told on twitter, it means “technical bottlenecks in a single company can have very large implications.”

“This, in turn – raises major questions about the dangers of (power) consolidation in the cloud market and the undeniable impact these often invisible actors have access to information,” she said.

Stay on top - Get the daily news in your inbox

DMCA / Correction Notice

Recent Articles

Related Stories