May 4 2019

The speed of BGP network propagation

At the end of March 2019 I did a talk at the INEX’s (Ireland’s biggest internet exchange point) Annual General Meeting. I was supposed to record it, however in a brief panic due to HDMI not working on my laptop I forgot to start a recording of it. Since people found it interesting I figured I would turn it into a blog post instead:

Intro side that is a rehash of a fast and furious poster with JunOS placed on a car

I spent so much time on this opening side that I feel the need to include it even if I’m not doing an introduction this time.

a long time ago... in a job a year back

So a long time ago… in a job fa- well a year back. We were dealing with the lovely routing design of anycast.

Anycast

For those who need a quick primer, it’s a routing design that allows you to do a more natural regional based load balancing, allowing you to put server clusters in different regions and serve traffic local to those regions without having to play tricks with DNS:

Many nodes on the map

This works by having all participating nodes announce the same IP prefixes globally, and with some careful routing tuning (mainly careful selection of upstream transit/peering providers) you can get good load balancing and latency results due to traffic being served more local to the visitor’s region.

Many nodes announcing the same IP prefix

Sadly, a lot of networks struggle to get consistent network announcements to work right, often resulting in totally backwards-from-logic routing:

Cat bowl based anycast routing

However even those networks that do get most regions down, regions like Asia are much harder to get to route correctly, partly due to local ISPs either dealing with overloaded links or their links not always following logical geographic points.

Two cats eating from the same bown, when one cat really should be eating from the other

The crux of the problem is ensuring your routing announcements are consistent over all regions and over almost all major interconnection ISPs (Tier 1’s)

Routing providers changing, but the AS_PATH staying the same length

In simple setups, this really just means you need to keep your AS_PATH’s as close to the same in all the regions and carriers you want to have routing control over.

all countries announcing the same AS_PATH

This basically just means you should be attempting to use the same providers and traffic engineering parameters in all regions, AS_PATH prepending being one of the more basic ones.

one country annoncing a shorter path than the rest, then bursting into flames

However, as systems get larger and more complex, eventually a mistake is going to be made. In the case of the job, a configuration misunderstanding during maintenance of a router caused it to drop a traffic engineering prepend. This caused a huge traffic shift globally towards this router, and almost instantly overloading the site.

This was a regrettable incident, and it became clear that while the traffic engineering prepend was useful in the past, at that point in the network it was more of a liability than a useful tool. So it was time to remove it.

But what if we were to make the same mistake again? This time we are changing a lot more router configuration at once. It’s worth thinking about failure modes here. There are two ways the change could fail:

one country announcing a longer AS_PATH than the rest, causing traffic to move away

The first one being that the change applies to a large percentage of the routers over the network, but some of them fail. This would cause traffic to mostly shift away from those locations and head to other nearby sites. As long as not too many sites have this issue, this is the best way it can fail.

one country on fire due to announcing a too short of a AS_PATH

The nastier way it can fail is that most routers don’t end up being changed but a small percentage of them do.

This will be a repeat of the first incident, however more routers will be involved, and we would be dealing with a lot more routers that would need a rescue configuration rollback or hotfix.

The story of them changing this in a sane way is not mine to tell, however someone at the front of this change did a talk at RIPE NCC’s bi-annual event about how it was done:

The good news is, the change went through fine, and no router got left behind! A small amount of traffic churn happened while routers globally had to update their routing tables and inform the other internal routers they were connected to about that.

During this time, it was observed that not all of the providers accepted this change at the same time, some providers seemed to reconverge almost instantly, but others were noticeably slow.

This begs the question, how long does this sort of thing generally supposed to take? And are some providers better than other, who is the fastest? Who is the slowest?

However, for this we need to define what it means for a route to be propagated. There are two valid ways (in my eyes) this could be:

First Announcement wins diagram

The “First Announcement Wins” method is quite literally what it says on the tin. When we see a BGP update message for our prefix, that provider+location combo wins (or if they are late, loses)

This could be slightly flawed, since some networks might have hard to observe mechanisms for quickly sending routing information inside their network, but those initial route updates internally may not be sensible network paths.

First Stable Announcement wins diagram

In “First stable announcement wins” testing is done to ensure that whatever route becomes “stable” (stops changing its internal routing in the provider backbone) is declared the winner.

In my eyes this what most network engineers are looking for, however it also has a large issue attached to it

Internals of networks are unknown and often magic

Figuring out what is a stable route is a non-trivial amount of complexity, and no matter what way I do it, I think it is not measurable to the 10’s of milliseconds.

For this reason, the experiment we are doing is using “First Announcement Wins.”

The propagation race works like so:

slide containing the setup of the bgp race

The high precision timestamps are important here, and a detail that actually ended up being slightly devastating for the first few runs due to the inaccuracy of system clocks.

juniper supervisor card with the 10mhz and PPS ports highlighted

You see, I now have a stronger respect (in that I now actually believe they have worth) for the PPS and 10Mhz clock inputs on a lot of high-end carrier routers. Since time syncing is actually incredibly hard when you go above more than 2 systems. Locking all systems to a stable clock source is immensely nice, and before you ask, NTP does not really get that close in real life situations with a wide range of geographically separated targets.

After a lot of time syncing and timestamp offset correction, I ended up with a linear list of announcements by server location (airport codes to signify where they are, since that’s generally what the networking industry seems to use)

a listing of all announcements in tsv format

For AS6453 (Tata communications) times looked decent to start with. Giving that none of the BGP route update collector nodes had Tata as a direct provider, this is basically racing how fast Tata’s peers can send routes around.

It’s interesting that it seems to have a 500ms ish minimum, but after that routes start to move around the globe very fast, with the exception of EWR (New York area), Likely an outlier.

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0	0	Origin	6453	No
523.2ms	0.523	sea	6453	No
62.5ms	0.586	cdg	6453	No
6.8ms	0.593	fra	6453	No
25ms	0.618	lax	6453	No
107.5ms	0.725	yyz	6453	No
92.8ms	0.818	nrt	6453	No
12.9ms	0.831	sjc	6453	No
9ms	0.84	mia	6453	No
5.4ms	0.845	ams	6453	No
58.4ms	0.903	lhr	6453	No
22.9ms	0.926	dfw	6453	No
68.4ms	0.994	ord	6453	No
26.5ms	1.021	fra	6453	No
45ms	1.066	sin	6453	No
455.3ms	1.521	lhr	6453	No
191.2ms	1.712	syd	6453	No
19764.8ms	21.477	dfw	6453	No
171947.8ms	193.425	ewr	6453	No

For AS174 (Cogent Communications) it seems that propagation takes a little longer, due to policy on the upstream ISP used for route collection, Cogent only was imported from other carriers. So there is a similar effect with Tata here. However it is odd that Toronto (YYZ) sees the route first after announcement, Since the announcement is done in London (LHR).

This is likely the impact of a route reflector or something inside the network

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0	0	Origin	174	No
2425.1ms	2.425	yyz	174	No
2202.1ms	4.627	yyz	174	No
3328ms	7.955	yyz	174	No
1141.3ms	9.096	sea	174	No
85.1ms	9.181	lhr	174	No
158ms	9.339	syd	174	No
230.7ms	9.57	dfw	174	No
56.2ms	9.626	ewr	174	No
65.8ms	9.692	nrt	174	No
107.7ms	9.8	fra	174	No
18.7ms	9.819	lax	174	No
49.8ms	9.869	mia	174	No
33.3ms	9.902	cdg	174	No
18.2ms	9.92	ams	174	No
74.2ms	9.994	sjc	174	No
4ms	9.998	sin	174	No
16898.5ms	26.897	ord	174	No
531.6ms	27.429	dfw	174	No

For AS3257 (GTT) we are finally seeing some timing data that is based on providers we are locally connected to. GTT does seem to send things around the world reasonably fast, at a shiny 1.9 seconds (apart from EWR, thus supporting that this is more of a data point error rather than anything else)

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0ms	0	Origin	3257	No
721.1ms	0.721	yyz	3257	No
64.2ms	0.785	lax	3257	Yes
9.4ms	0.794	dfw	3257	Yes
52.9ms	0.847	sea	3257	Yes
21.4ms	0.868	sjc	3257	Yes
44.9ms	0.913	mia	3257	Yes
15.8ms	0.929	nrt	3257	No
82.9ms	1.012	fra	3257	No
20.1ms	1.032	cdg	3257	Yes
36.6ms	1.069	sin	3257	No
19.7ms	1.089	ams	3257	No
19.1ms	1.108	lhr	3257	No
256.1ms	1.364	syd	3257	No
19.1ms	1.383	ord	3257	Yes
208.5ms	1.592	lhr	3257	No
281.2ms	1.873	fra	3257	Yes
88.7ms	1.962	nrt	3257	No
114745ms	116.708	ewr	3257	No

AS1299 (Telia) has a more logical timing, 0.6 seconds after we announce in London it appears in Paris and Frankfurt directly and it is fully propagated to all nodes less than 2 seconds after that. However other carriers beat telia to their own route!, If you look at ORD (Chicago) and MIA (Miami) you can see other carriers pick up the route from telia at another location, and hand it over to our provider before 20 seconds later, it arrives as a direct route.

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0	0	Origin	1299	No
632.3ms	0.632	cdg	1299	Yes
11.3ms	0.643	fra	1299	Yes
107.1ms	0.75	sea	1299	No
76.8ms	0.827	ams	1299	Yes
17.6ms	0.845	lhr	1299	No
40.5ms	0.886	yyz	1299	No
27.6ms	0.914	mia	1299	No
59.9ms	0.974	sjc	1299	No
8.2ms	0.982	dfw	1299	Yes
9.4ms	0.991	lax	1299	No
10.9ms	1.002	ewr	1299	Yes
5.9ms	1.008	yyz	1299	Yes
61.5ms	1.07	lhr	1299	Yes
137.1ms	1.207	ord	1299	No
211.8ms	1.419	nrt	1299	No
12.3ms	1.431	sin	1299	No
499ms	1.93	nrt	1299	No
198.6ms	2.129	syd	1299	No
21135.5ms	23.265	mia	1299	Yes
3196.4ms	26.461	ord	1299	Yes

Level 3 (AS3356) does by far the worst in this test, taking 18 seconds from announcing the test prefix to it until it appears anywhere on the internet, and it appears in SEA (Seattle) of all places, and then from there other carriers pick up that route and propagate it faster than Level 3. Some 30 seconds laster Level 3 has caught up and the route is now seen in all places with Level 3 peering.

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0	0	Origin	3356	No
18508.1ms	18.508	sea	3356	Yes
365.6ms	18.874	yyz	3356	Yes
241.9ms	19.116	lhr	3356	No
251.2ms	19.367	cdg	3356	No
174.5ms	19.541	mia	3356	No
87.3ms	19.628	fra	3356	No
6.1ms	19.634	sin	3356	No
6.9ms	19.641	ewr	3356	No
53.3ms	19.694	ams	3356	No
212.3ms	19.906	dfw	3356	No
72.1ms	19.978	lax	3356	No
0.9ms	19.979	nrt	3356	No
187.5ms	20.166	sjc	3356	No
194.1ms	20.36	syd	3356	No
10094.8ms	30.455	ewr	3356	Yes
3963.6ms	34.419	mia	3356	Yes
1207.5ms	35.627	ord	3356	No
1684.1ms	37.311	sjc	3356	Yes
476.2ms	37.787	fra	3356	Yes
434.5ms	38.222	dfw	3356	Yes
1264.5ms	39.487	ord	3356	No
5106.7ms	44.594	ams	3356	Yes
1426.2ms	46.02	ord	3356	Yes
695.5ms	46.715	lhr	3356	Yes
3801.6ms	50.517	cdg	3356	Yes

Last but not least is AS2914 (NTT Communications). Who while is not the fastest at sending routes global, they did appear to be the most smooth and consistent.

Last update	Wall time (s)	Location	Upstreaming AS	Collector directly connected
0	0	Origin	2914	No
814.9ms	0.815	ewr	2914	Yes
90.6ms	0.906	fra	2914	Yes
290.2ms	1.196	lax	2914	Yes
0.6ms	1.197	cdg	2914	Yes
121.4ms	1.318	yyz	2914	No
5.1ms	1.323	lhr	2914	Yes
24ms	1.347	nrt	2914	Yes
34ms	1.381	ams	2914	No
1.1ms	1.382	dfw	2914	No
29.6ms	1.412	sea	2914	No
81.4ms	1.493	sjc	2914	No
68.8ms	1.562	sea	2914	Yes
63.4ms	1.625	ord	2914	No
57.1ms	1.682	mia	2914	No
75.7ms	1.758	ams	2914	Yes
108.9ms	1.867	sjc	2914	Yes
196.8ms	2.064	mia	2914	Yes
75.8ms	2.14	syd	2914	Yes
54.9ms	2.195	dfw	2914	Yes
0ms	2.195	ord	2914	Yes
17.5ms	2.212	sin	2914	Yes

Now that we have done all the carriers you may think that is it, however there is a different kind of propagation we can observe:

the dying breaths of a route

As we can race the network in sending out bgp routes, we can also race them withdrawing them!

This is a test that is harder to see on the routing table itself, so it’s easier (and much more fun) to observe it by simply doing a traceroute to a prefix and then withdrawing it from all providers:

a gif showing a route slowly draining out of the routing table of many carriers

Here you can see the route slowly being released out of all of the carriers, and then the carrier backbones, and then the carrier inter peering relationships. It also exposes some interestingly strange routing too as options to route a prefix begin to run out!

the ending/questions slide

Anyway, as I said to the audience, We have had the fast bit, now we can have the furious part! If you generally like this kind of post, I aim to post once a month on various (mostly networking related) matters. If you want to stay up to date with that, you can either use my blog’s rss or you can follow me on twitter for updates when the next post happens.

I would like to thank AS57782 / Cynthia Revstrom for lending some IPv4 space for this post, and helping out on the traceroute demo you see above.

If you do have any questions about this talk, please feel free to reach out on the email that is on the slide above! Until next time!

Related Posts:

Playing battleships over BGP (2018)

What would a EvE online Internet look like? (2019)

Random Post:

A surprising amount of people want to be in North Korea (2017)