At 7AM (UTC) on Tuesday May 20th 2025 a BGP message was propagated that triggered surprising (to many) behaviours with two major BGP implementations that are often used for carrying internet traffic.
This caused a large number of “internet facing” BGP sessions to automatically shut down, causing at the very least some routing instability, and at worst brief loss of connectivity for some networks.
Using the sessions that people feed to bgp.tools, we can see here a version of the update that caused this behaviour, it is a relatively unremarkable BGP Update for a /16, except it had a BGP Prefix-SID Attribute that was not only unwelcome (it is unexpected to see this on internet table BGP updates), but it was also corrupt with all of its internal data being 0x00.
Most implementations (IOS-XR/Nokia SR-OS) correctly filtered this out without causing any problems assuming their systems have been setup for RFC7606 (“BGP error tolerance”), however an interesting interaction with JunOS and Arista EOS caused JunOS to carry the corrupt message, and Arista EOS devices to reset sessions when receiving the message from (likely) a JunOS device.
Since a lot of internet transit carriers use Juniper hardware running JunOS, this meant that those running Arista EOS and connected to an upstream transit carrier router running JunOS would have had their access to the internet severed for a period (likely up to 10 mins).
After filtering through the whole bgp.tools archive for that period, it would appear that a number of AS origins were involved with this incident. Suggesting that rather than the attribute having been added by the network that originated the prefix, it was added by a carrier in the middle on its way to the wider internet.
The 4 candidates that appear in all of the offending messages are:
However, bgp.tools has captured routes for the impacted prefixes without the faulty BGP attribute from “[…] 151326 138077 […]“, meaning the culprit that added the attribute was likely Starcloud (AS135338) or Hutchison (AS9304).
Some prefixes seen in updates carrying the attribute (despite very likely not being the ones that added the offending attribute) are
This incident was further amplified by Hutchison/AS9304 being on a large number of internet exchanges, meaning that the offending messages were sent to IX route servers that typically are running bird. Since Bird does not support BGP SID, the message was distributed to many multi-terabit internet exchanges without being filtered, spreading the chaos to more than just internet transit sessions.
BGP Prefix-SID Attribute should generally only be seen in internal BGP sessions, as the point of them (as defined in RFC8669) is to help define the route the traffic will take within a single network to get to the destination.
The reason that one of these attributes leaked out into the global routing table in the first place could have been caused by an external BGP session being configured as an internal one.
While it is hard to definitively claim who was impacted, after looking at networks with very large churn (compared to their size) immediately after the initial problematic BGP message was emitted, I count around 100 seperate networks that hit issues, some high confidence examples include:
In “normal” times the bgp.tools’s route collector ingests around 20,000 to 30,000 messages per second, during this incident the average 10 second message rate was well over 150,000 /s. Indicating significant disruption to many internet paths.
While the root cause (or even perpetrator) is not entirely clear, the fact that it propagated over the internet at scale is a demonstration of the situation/risk that I described in my previous post “Grave flaws in BGP Error handling” - August 2023.
In this case while other vendors detected the faulty attribute and suppressed the announcement, Juniper allowed it to propagate to peers, until it ultimately hit Arista devices that did not have (or contained faulty) BGP error tolerance code.
Junipers own documentation for JunOS’s BGP error tolerance points out that it does not look at all parts of the message, despite it potentially being able to understand that it is faulty.
This is a curious decision, in which JunOS will save itself from a remote induced session reset, but then forward such messages to other peers (or in business words, likely towards your customers).
I have no happy ending for this. While the outage was short, the impact could have been worse. These kinds of incidents/bugs keep me up at night. As more and more services move to be IP based the scope of internet outages is no longer “consumers cannot get to their email”, but it starts to become “TV broadcasts fail” and “emergency service calls no longer work”. These begin to increase the chance of real world human casualties triggered (or at least exacerbated) by bugs such as this.
Filtering through the updates and piecing together this incident was a lot of fun, if you run a network yourself with a full routing table, and you are not part of the already 2570 running sessions that give bgp.tools data, you can help the debugging of these future incidents by setting up such data feeds!
If you want to stay up to date with the blog you can use the RSS feed or you can follow me on Fediverse @benjojo@benjojo.co.uk
Until next time!
Related Posts:
Grave flaws in BGP Error handling (2023)
Better IX network quality monitoring (2024)
Random Post: