BGP is the glue between all of the thousands of border routers that make up the internet (you can find this post (battleships) and this post (EvE) as a crash course on how BGP works).
With the current “default free zone” containing around 1,000,000 routes, the table is full of up to date routing information on how to get to almost everything. However as it came to slowly haunt me while working on a side project ( bgp.tools ) that routers don’t always have up to date information…
To understand a stuck route, you have to first understand the natural life-cycle of a route/prefix with BGP.
In the above example, you see that reachability information is passed along the chain of connected routers. When the originating router withdraws a prefix/route then that information is passed along peers as well, and assuming those peers do not have an alternative information on how to get to that IP prefix, the IP prefix becomes unreachable.
However in the stuck route case something else happens:
In this example the middle route does get sent the withdrawal message, but for whatever reason it loses it.
This is bad since the routers downstream of that middle one do not become aware of the prefix/route’s unavailability, meaning they might continue to send traffic that way. This also means that memory is still consumed to hold a route that is no longer reachable. Depending on the situation that memory can be very valuable (and limited).
To check how bad the issue is, I setup a BGP route cycler that would announce a different IPv6 prefix every calendar day of the month. This technically would mean that systems should only see two routes from me, a /40 IPv6 block, and a smaller “hole punch” /48 that changes daily:
However as expected, after a while of running this, a small pile of stuck routes start to pop up on various sites, like bgp.he.net:
and on my own site bgp.tools:
So what gives? How do these routes get stuck? Where are they getting stuck? Can we un-stick them?
So I attempted to track them down by looking into the RIPE-NCC RIS MRT files.
PREFIX: 2a0b:6b86:d28::/48
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[50673 33891 1299 6939 42615 212232]]
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[553 33891 1299 6939 42615 212232]]
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[25220 33891 1299 6939 42615 212232]]
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[47692 33891 1299 6939 42615 212232]]
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[49697 61438 33891 1299 6939 42615 212232]]
PREFIX_AS_PATH: 2a0b:6b86:d28::/48 [[51184 47692 33891 1299 6939 42615 212232]]
Here we can see a prefix that is stuck (the day at the time of writing is 12th, but the prefix was for the 28th) and we assume a few things based on these paths. AS212232 is the announcing ASN, this is me. It follows my upstream (42615), then Hurricane Electric (6939), then Telia (1299), then it hits Core-Backbone (33891) where it appears to get stuck. We can see this because at that point it starts to fork out to other networks, so it’s more than likely that a router inside Core-Backbone has failed to process the withdraw message that my AS sent out on that day.
In this case it’s pretty harmless, but as the global routing table continues to grow the graveyard of “casually” stuck prefixes could get rather annoying.
There are times however when not withdrawing can be catastrophic. On August 30th 2020 one of the largest carriers (Lumen, better known as Centurylink, who bought Level 3) pushed out a firewall rule to block all internal BGP traffic across their network. This flowspec rule caused mass chaos internally within their network, causing widespread unreachability, but the difference between this and a normal carrier outage is that their routers did not send BGP withdraws to indicate they could not reach parts of their network to their peers and customers. Since this outage went on for many hours before it was resolved. Many networks had to scramble to alter configs to avoid Lumen/Level 3 at all costs. The exact cause of their routers not withdrawing routes is not known, but there are other public write ups of the chaos the event caused
There is one way this can happen however, and it’s caused by a more general flaw in the BGP protocol itself.
BGP runs over TCP (although it also has UDP and SCTP port numbers reserved) and is subject to TCP socket semantics who Daniel Morsing best described as “TCP is an underspecified two-node consensus algorithm”.
Keeping that in mind, it’s also worth understanding BGP’s own internal timers:
# bird2c s p a rr1
BIRD e7aa14a-x ready.
Name Proto Table State Since Info
rr1 BGP --- up 2021-01-27 Established
BGP state: Established
Neighbor address: xxx.xxx.xxx.xxx
Neighbor AS: 206924
Neighbor ID: xxx.xxx.xxx.xxx
...
Hold timer: 198.758/240
Keepalive timer: 6.716/80
...
BGP has two internal timers, the Hold timer
is a countdown timer from how long it has been since it has received a keepalive packet from the peer, and the Keepalive timer
is a countdown timer from how long until the process itself will send a Keepalive packet to the peer.
If the Hold timer
ever reaches 0, then it is assumed that the peer has become unavailable (since TCP cannot discover quickly when a connection blackholes).
The fatal flaw inside the protocol that Job Snijders pointed out in a mailing list post to the IETF was that BGP daemons might not handle the edge case of when a peer has a 0 sized TCP window.
Zero windows are a commonly forgotten edge case of TCP sockets. They normally happen when the remote side of the sockets application is not reading data out of its own TCP receive queue, and since that is a finite amount, will fill up with yet unread data. At the point the remote TCP stack has no option but to say that it cannot accept any more data (aka it’s TCP Window is 0 bytes large). This means that the local side’s own send queue will begin to fill, but that is also finite, and when the local send queue is full. The write call to the socket will block or return EWOULDBLOCK. Meaning that data can no longer be written into the socket.
In BGP this seems to cause remote peers to not be able to send KEEPALIVE, CEASE, UPDATE, and WITHDRAW messages.
I built a test that can reproduce this, and Job Snijders confirmed that Juniper, Arista, Cisco, as well as bird, quagga/FRR all hang, unable to send keepalives or any other data, but they do not terminate the session.
The problem is that depending on the network conditions, this can get ugly. For example:
Assuming you have a running session against a malicious BGP peer. The peer could reduce it’s TCP window to 0 on demand.
This would then prevent you from sending messages to the router or acting on them, however it would allow the malicious router to send them to you. So it can keep sending the keepalives required to keep the hold timer from expiring, it could also prevent you from fully shutting down the session, or sending out route updates.
While I have no reason to believe this is being maliciously exploited in the real world, I suspect it is happening by mistake in a few places
With all of this in mind, how would one stop this from happening?
A solution to this would be to invent a new 2nd kind of timer that works similar to the already existing “Hold Timer” in BGP, except have it count down from being able to send something successfully. Since sending is almost instant this timer hitting zero signals something critically wrong with the router that would likely impede its ability to pass traffic responsibly. So shutting down BGP peering would be a responsible thing to do.
For this to happen though, we will need to patch the RFCs for BGP to add this in, and that means getting a RFC for that. In the meantime you can follow Me, Job Snijders, and maybe others adventures in writing up a RFC (or in it’s current state a internet-draft) over at https://datatracker.ietf.org/doc/html/draft-spaghetti-idr-bgp-sendholdtimer
Until then, The zero window edge case may be around for quite a while in routers.
If you want to stay up to date with the blog you can use the RSS feed or you can follow me on Twitter
Until next time!
Related Posts:
The strange case of ICMP Type 69 on Linux (2015)
Playing battleships over BGP (2018)
Random Post: