Cloudflare outage affects global traffic in 19 data centers

cloud security, safe operation

Huge outage is the result of a network configuration change going awry

Michael Novinson (Michael Novinson) •
June 22, 2022

Although the affected data centers represented only 4% of Cloudflare’s total network, the outage affected 50% of total requests. (Source: Cloudflare Blog)

An errant network configuration change caused a massive outage at Cloudflare, rendering many of the world’s most popular websites inaccessible for 75 minutes.

See also: Fireside Chat | Zero Tolerance: Take Control of the Environment You’ll Meet Your Opponents

The San Francisco-based internet infrastructure provider said the unfortunate configuration change was designed to improve the resiliency of Cloudflare’s 19 busiest data centers, which handle a significant portion of global traffic. Instead, the change caused a power outage on Tuesday that took everything from Amazon Web Services and Minecraft to UPS and DoorDash offline, according to the company blog.

Tom Strickx, Cloudflare’s head of edge networking technology, and Jeremy Hartman, senior vice president of production engineering, wrote: “While Cloudflare has invested heavily … to improve service availability, we have clearly failed to meet our customers’ expectations during this very painful event. Expectations.” “We apologize for the disruption caused to our customers and all users who were unable to access internet properties during the outage.”

Strickx and Hartman said Cloudflare has spent the past 18 months working to convert its busiest locations to a more flexible and resilient architecture known as the Multi-Colo PoP. Nineteen data centers have been converted to this architecture, including Atlanta, Chicago and Los Angeles in the Americas; London, Frankfurt and Madrid in Europe; and Singapore, Sydney and Tokyo in Asia Pacific.

“While these locations represent only 4 percent of our entire network, outages impacted 50 percent of total requests,” Strickx and Hartman said.

What went wrong?

As part of this new architecture, Strickx and Harman say there is an additional layer of routing that allows Cloudflare to easily disable or enable parts of its internal network for maintenance or to deal with issues. They say the new architecture provides Cloudflare with significant reliability improvements and enables the company to perform maintenance without disrupting customer traffic.

Strickx and Harman say Cloudflare uses the BGP protocol to define which IP addresses are advertised to or accepted by other networks that Cloudflare needs to connect to. According to the blog post, the policy change could mean that previously published IP addresses are no longer accessible on the internet.

When deploying changes to the IP address advertising policy, the reordering of terms caused Cloudflare to withdraw a critical subset of IP addresses. As a result of this evacuation, Strickx and Harman said, it was more difficult for Cloudflare engineers to reach the affected locations to revert the problematic changes.

Cloudflare started ringing alarm bells five minutes after the problematic change was pushed forward, making the first change on routers to verify the root cause 24 minutes after 19 data centers were unexpectedly offline. 31 minutes after the incident, the root cause was found and understood. At that point, work began to resume with the problematic changes, Cloudflare said.

Forty-four minutes after that, all problematic changes were reverted. During this period, Cloudflare said, the issue occasionally resurfaced as network engineers checked each other’s changes, which caused issues that had been mitigated to resurface. According to Cloudflare, the incident ended less than 90 minutes after it was first announced.

What’s the difference?

Going forward, Strickx and Harman said they plan to review Cloudflare’s processes, architecture, and automation, and implement some changes immediately to ensure this doesn’t happen again. From a process perspective, Cloudflare admits that its staggered process does not include any of the most-visited Multi-Colo PoP data centers until the last step.

Strickx and Harman said that future Cloudflare change procedures and automation will need to include Multi-Colo POP-specific procedures for testing and deployment to avoid unintended consequences.

From an architectural point of view, Cloudflare said, incorrect router configuration prevented the correct routes from being announced, which in turn prevented traffic from properly flowing into the company’s infrastructure. Going forward, Strickx and Harman said, policy statements that lead to misrouting will need to be redesigned to prevent inadvertent misordering.

The pair also said automation plans could mitigate some or all of the impact of the outage. Automation will primarily focus on implementing an improved interleaving strategy for rolling out network configurations, which Strickx and Harman say will significantly mitigate the overall impact of outages.

Finally, Strickx and Harman say that automatic “commit confirmation” rollbacks will greatly reduce the time to resolve issues during an incident.

“This incident has had wide-ranging ramifications and we take usability very seriously,” they wrote. “We have identified several areas for improvement and will continue to work to identify any other gaps that could lead to a recurrence.”

.

Leave a Comment