Facebook has had a rough week. Internal documents leaked by the company show that the company purposefully promoted anger and misinformation on its platform, forcing Facebook to defend itself in front of Congress . But, as it struggled to control its public image on Monday, something strange happened: Facebook and its services went offline for six hours.
The unanticipated outage , which was Facebook’s longest since 2008, brought down all of its apps and services. During the outage, people all over the world were unable to access Instagram, WhatsApp, Oculus, and other Facebook-owned platforms. Because many people use Facebook to log into third-party apps, they were locked out of games, athletic apps, and other software.
So What Happened Exactly?
We’ve known the fundamentals since yesterday. Facebook and its domains were removed from global routing tables, preventing anyone from connecting to the company’s servers. The “facebook.com” domain vanished from the internet and even appeared as “for sale” on domain websites (an accident, but still).
We concluded that something bad occurred within Facebook’s facilities because the company operates its own registrar. Because a successful hacking attempt on this scale is unlikely, we were left with two options: either Facebook’s server infrastructure experienced a critical failure, or a Facebook employee turned off the power. Given the shocking 60 Minutes interview with a Facebook whistleblower that aired on Sunday, the latter option appeared to be a strong possibility.
However, Facebook now says that the outage was caused by a “routine maintenance job.” Engineers at the company issued a command to assess Facebook’s global network capacity, and for whatever reason, the command “inadvertently took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”
Facebook’s networks were no longer able to respond to DNS queries, rendering them completely inaccessible. This issue necessitated a hands-on solution from engineers, who had difficulty getting on-site due to Facebook facilities being protected by smart security systems such as network-connected keycards. Regrettably, Facebook hosts these security systems on its own servers, which were unavailable.
We don’t know how Facebook engineers gained access to the company’s servers—reports that they used an angle grinder to break down doors and cages have not been confirmed by Facebook or independent sources. In any case, Facebook was able to resolve the issue, but it had to gradually bring its services back online in order to avoid a surge in traffic, which would result in a dramatic increase in power consumption and damage to Facebook’s server hardware.
The ramifications of this outage may not be obvious. After all, even if you didn’t have Instagram, you probably had a pretty productive workday! However, in some countries, such as India, WhatsApp is the primary means of mobile communication. If the Facebook outage had lasted a week, or even a few days, it could have had serious consequences for Indian business, emergency medicine, and society.
And, as CloudFlare documented, people began to repeatedly refresh Facebook and its services after they went down, resulting in a 30X increase in traffic. While the increased traffic is unlikely to have hampered Facebook’s efforts to restart its servers, it did place a minor strain on non-Facebook networks, indicating that future outages could cripple internet infrastructure as a whole.