How malformed packets caused CenturyLink’s 37-hour, nationwide outage

The switching module sent these malformed packets “as network management instructions to a line module,” and the packets “were delivered to all connected nodes,” the FCC said. Each node that received the packet then “retransmitted the packet to all its connected nodes.”

Source: How malformed packets caused CenturyLink’s 37-hour, nationwide outage | Ars Technica

But the outage continued because “the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node,” the FCC wrote. Just after midnight, at least 20 hours after the problem began, CenturyLink engineers “began instructing nodes to no longer acknowledge the malformed packets.” They also “disabled the proprietary management channel, preventing it from further transmitting the malformed packets.”

Why Google Went Offline Today and a Bit about How the Internet Works

Unfortunately, if a network starts to send out an announcement of a particular IP address or network behind it, when in fact it is not, if that network is trusted by its upstreams and peers then packets can end up misrouted. That is what was happening here.

I looked at the BGP Routes for a Google IP Address. The route traversed Moratel (23947), an Indonesian ISP. Given that I’m looking at the routing from California and Google is operating Data Centre’s not far from our office, packets should never be routed via Indonesia. The most likely cause was that Moratel was announcing a network that wasn’t actually behind them.

via Why Google Went Offline Today and a Bit about How the Internet Works – CloudFlare blog.

When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem at around 2:50 UTC / 6:50pm PST. Around 3 minutes later, routing returned to normal and Google’s services came back online.