Cloudflare Shares Behind-the-Scenes Look at Facebook Outage’s Cause

Getty In this photo illustration, the Facebook logo is displayed next to a screen showing that Facbook service is down on October 4, 2021.

Reports are coming in that the Facebook outage has been so bad, some employees were even locked out of their buildings because their badges wouldn’t work. Executives with Cloudflare, including the CTO, are sharing their thoughts on the cause of this massive, ongoing outage affecting Facebook, Instagram, WhatsApp, and other Facebook properties. In an article and on Twitter, they shared what they saw happening behind the scenes.


The Cloudflare CTO Wrote: ‘About Five Minutes Before Facebook’s DNS Stopped Working, We Saw a Large Number of BGP Changes’

John Graham-Cumming, CTO of Cloudflare, wrote on Twitter: “About five minutes before Facebook’s DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook’s ASN.”

BGP stands for “Border Gateway Protocol.” Cloudflare explained that this is a “mechanism to exchange routing information between autonomous systems (AS) on the Internet.”

This Post was deleted by the Post author. Learn more

In answer to someone asking what an ASN is, he wrote: “The number (the N) that identifies the entirety of Facebook’s network (oddly called an Autonomous System or AS) to the rest of the Internet. The Internet is a network of networks; a collection of ASNs.”

This Post was deleted by the Post author. Learn more

He also shared: “Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.”

This Post was deleted by the Post author. Learn more

When someone asked how this could happen, he pointed to a Reddit thread where someone claiming to be connected to Facebook shared their insights. That person’s account was later deleted and the thread was locked.

This Post was deleted by the Post author. Learn more

Heavy reported on the Facebook thread before the comments were deleted. The person had written at the time:

“DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC)…”

They continued, writing: “There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified… Part of this is also due to lower staffing in data centers due to pandemic measures… No discussion that I’m aware of yet that is considering a threat/attack vector… I believe the original change was ‘automatic’ (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don’t exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally…”

Facebook has not confirmed if this information from the Reddit thread is accurate. However, later reports from other sources did confirm some of the elements shared in these comments.


Cloudflare Posted an Article About How Facebook Was ‘Disconnected’ from the Internet

Cloudflare posted an article here about what contributed to causing the outage and what they saw behind the scenes. They started out by sharing that at 1651 UTC on Monday, October 4, they opened an internal incident called, “Facebook DNS lookup returning SERVFAIL.” The authors wrote: “[Facebook’s] DNS names stopped resolving, and their infrastructure IPs were unreachable.”

Cloudflare explained that the Internet’s “big routers” “have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations.”

They continued: “At 1658 UTC we noticed that Facebook had stopped announcing the routes to their DNS prefixes.” Around 15:40 UTC, shortly before this happened, they noticed a surge in routing changes from Facebook.

Cloudflare wrote: “Routes were withdrawn, Facebook’s DNS servers went offline, and one minute after the problem occurred, Cloudflare engineers were in a room wondering why 1.1.1.1 couldn’t resolve facebook.com and worrying that it was somehow a fault with our systems.”

The withdrawals, they said, basically disconnected Facebook from the Internet.

You can read the full explanation, with more examples to help users understand what was happening behind the scenes, in Cloudflare’s article here. This doesn’t reveal exactly what happened at Facebook to cause the issues, but it provides an explanation to help readers begin to understand what’s going on and how the whole thing started.


The New York Times Reported That the Cause Likely Wasn’t a Cyberattack, But a ‘Misconfiguration’ of the Servers

Sheera Frenkel of the New York Times reported that this was a domain problem that affected all of Facebook’s systems, and it was so bad that some employees were locked out of their offices.

Frenkel wrote: “Yes, it’s a domain problem so affecting all their systems. We are hearing stories of employees texting each other/using other tech messaging platforms to try and communicate about what is happening.”

Frenkel also noted: “Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.”

The New York Times reported that the cause was not likely a cyberattack. The Times noted that security experts told them it was likely a “misconfiguration of Facebook’s server computers.”

READ NEXT: Hallmark’s Christmas 2021 Lineup of Movies