At the end of March 2019 I did a talk at the INEX’s (Ireland’s biggest internet exchange point) Annual General Meeting. I was supposed to record it, however in a brief panic due to HDMI not working on my laptop I forgot to start a recording of it. Since people found it interesting I figured I would turn it into a blog post instead:
I spent so much time on this opening side that I feel the need to include it even if I’m not doing an introduction this time.
So a long time ago… in a job fa- well a year back. We were dealing with the lovely routing design of anycast.
For those who need a quick primer, it’s a routing design that allows you to do a more natural regional based load balancing, allowing you to put server clusters in different regions and serve traffic local to those regions without having to play tricks with DNS:
This works by having all participating nodes announce the same IP prefixes globally, and with some careful routing tuning (mainly careful selection of upstream transit/peering providers) you can get good load balancing and latency results due to traffic being served more local to the visitor’s region.
Sadly, a lot of networks struggle to get consistent network announcements to work right, often resulting in totally backwards-from-logic routing:
However even those networks that do get most regions down, regions like Asia are much harder to get to route correctly, partly due to local ISPs either dealing with overloaded links or their links not always following logical geographic points.
The crux of the problem is ensuring your routing announcements are consistent over all regions and over almost all major interconnection ISPs (Tier 1’s)
In simple setups, this really just means you need to keep your AS_PATH’s as close to the same in all the regions and carriers you want to have routing control over.
This basically just means you should be attempting to use the same providers and traffic engineering parameters in all regions, AS_PATH prepending being one of the more basic ones.
However, as systems get larger and more complex, eventually a mistake is going to be made. In the case of the job, a configuration misunderstanding during maintenance of a router caused it to drop a traffic engineering prepend. This caused a huge traffic shift globally towards this router, and almost instantly overloading the site.
This was a regrettable incident, and it became clear that while the traffic engineering prepend was useful in the past, at that point in the network it was more of a liability than a useful tool. So it was time to remove it.
But what if we were to make the same mistake again? This time we are changing a lot more router configuration at once. It’s worth thinking about failure modes here. There are two ways the change could fail:
The first one being that the change applies to a large percentage of the routers over the network, but some of them fail. This would cause traffic to mostly shift away from those locations and head to other nearby sites. As long as not too many sites have this issue, this is the best way it can fail.
The nastier way it can fail is that most routers don’t end up being changed but a small percentage of them do.
This will be a repeat of the first incident, however more routers will be involved, and we would be dealing with a lot more routers that would need a rescue configuration rollback or hotfix.
The story of them changing this in a sane way is not mine to tell, however someone at the front of this change did a talk at RIPE NCC’s bi-annual event about how it was done:
The good news is, the change went through fine, and no router got left behind! A small amount of traffic churn happened while routers globally had to update their routing tables and inform the other internal routers they were connected to about that.
During this time, it was observed that not all of the providers accepted this change at the same time, some providers seemed to reconverge almost instantly, but others were noticeably slow.
This begs the question, how long does this sort of thing generally supposed to take? And are some providers better than other, who is the fastest? Who is the slowest?
However, for this we need to define what it means for a route to be propagated. There are two valid ways (in my eyes) this could be:
The “First Announcement Wins” method is quite literally what it says on the tin. When we see a BGP update message for our prefix, that provider+location combo wins (or if they are late, loses)
This could be slightly flawed, since some networks might have hard to observe mechanisms for quickly sending routing information inside their network, but those initial route updates internally may not be sensible network paths.
In “First stable announcement wins” testing is done to ensure that whatever route becomes “stable” (stops changing its internal routing in the provider backbone) is declared the winner.
In my eyes this what most network engineers are looking for, however it also has a large issue attached to it
Figuring out what is a stable route is a non-trivial amount of complexity, and no matter what way I do it, I think it is not measurable to the 10’s of milliseconds.
For this reason, the experiment we are doing is using “First Announcement Wins.”
The propagation race works like so:
The high precision timestamps are important here, and a detail that actually ended up being slightly devastating for the first few runs due to the inaccuracy of system clocks.
You see, I now have a stronger respect (in that I now actually believe they have worth) for the PPS and 10Mhz clock inputs on a lot of high-end carrier routers. Since time syncing is actually incredibly hard when you go above more than 2 systems. Locking all systems to a stable clock source is immensely nice, and before you ask, NTP does not really get that close in real life situations with a wide range of geographically separated targets.
After a lot of time syncing and timestamp offset correction, I ended up with a linear list of announcements by server location (airport codes to signify where they are, since that’s generally what the networking industry seems to use)
For AS6453 (Tata communications) times looked decent to start with. Giving that none of the BGP route update collector nodes had Tata as a direct provider, this is basically racing how fast Tata’s peers can send routes around.
It’s interesting that it seems to have a 500ms ish minimum, but after that routes start to move around the globe very fast, with the exception of EWR (New York area), Likely an outlier.
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0 | 0 | Origin | 6453 | No |
523.2ms | 0.523 | sea | 6453 | No |
62.5ms | 0.586 | cdg | 6453 | No |
6.8ms | 0.593 | fra | 6453 | No |
25ms | 0.618 | lax | 6453 | No |
107.5ms | 0.725 | yyz | 6453 | No |
92.8ms | 0.818 | nrt | 6453 | No |
12.9ms | 0.831 | sjc | 6453 | No |
9ms | 0.84 | mia | 6453 | No |
5.4ms | 0.845 | ams | 6453 | No |
58.4ms | 0.903 | lhr | 6453 | No |
22.9ms | 0.926 | dfw | 6453 | No |
68.4ms | 0.994 | ord | 6453 | No |
26.5ms | 1.021 | fra | 6453 | No |
45ms | 1.066 | sin | 6453 | No |
455.3ms | 1.521 | lhr | 6453 | No |
191.2ms | 1.712 | syd | 6453 | No |
19764.8ms | 21.477 | dfw | 6453 | No |
171947.8ms | 193.425 | ewr | 6453 | No |
For AS174 (Cogent Communications) it seems that propagation takes a little longer, due to policy on the upstream ISP used for route collection, Cogent only was imported from other carriers. So there is a similar effect with Tata here. However it is odd that Toronto (YYZ) sees the route first after announcement, Since the announcement is done in London (LHR).
This is likely the impact of a route reflector or something inside the network
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0 | 0 | Origin | 174 | No |
2425.1ms | 2.425 | yyz | 174 | No |
2202.1ms | 4.627 | yyz | 174 | No |
3328ms | 7.955 | yyz | 174 | No |
1141.3ms | 9.096 | sea | 174 | No |
85.1ms | 9.181 | lhr | 174 | No |
158ms | 9.339 | syd | 174 | No |
230.7ms | 9.57 | dfw | 174 | No |
56.2ms | 9.626 | ewr | 174 | No |
65.8ms | 9.692 | nrt | 174 | No |
107.7ms | 9.8 | fra | 174 | No |
18.7ms | 9.819 | lax | 174 | No |
49.8ms | 9.869 | mia | 174 | No |
33.3ms | 9.902 | cdg | 174 | No |
18.2ms | 9.92 | ams | 174 | No |
74.2ms | 9.994 | sjc | 174 | No |
4ms | 9.998 | sin | 174 | No |
16898.5ms | 26.897 | ord | 174 | No |
531.6ms | 27.429 | dfw | 174 | No |
For AS3257 (GTT) we are finally seeing some timing data that is based on providers we are locally connected to. GTT does seem to send things around the world reasonably fast, at a shiny 1.9 seconds (apart from EWR, thus supporting that this is more of a data point error rather than anything else)
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0ms | 0 | Origin | 3257 | No |
721.1ms | 0.721 | yyz | 3257 | No |
64.2ms | 0.785 | lax | 3257 | Yes |
9.4ms | 0.794 | dfw | 3257 | Yes |
52.9ms | 0.847 | sea | 3257 | Yes |
21.4ms | 0.868 | sjc | 3257 | Yes |
44.9ms | 0.913 | mia | 3257 | Yes |
15.8ms | 0.929 | nrt | 3257 | No |
82.9ms | 1.012 | fra | 3257 | No |
20.1ms | 1.032 | cdg | 3257 | Yes |
36.6ms | 1.069 | sin | 3257 | No |
19.7ms | 1.089 | ams | 3257 | No |
19.1ms | 1.108 | lhr | 3257 | No |
256.1ms | 1.364 | syd | 3257 | No |
19.1ms | 1.383 | ord | 3257 | Yes |
208.5ms | 1.592 | lhr | 3257 | No |
281.2ms | 1.873 | fra | 3257 | Yes |
88.7ms | 1.962 | nrt | 3257 | No |
114745ms | 116.708 | ewr | 3257 | No |
AS1299 (Telia) has a more logical timing, 0.6 seconds after we announce in London it appears in Paris and Frankfurt directly and it is fully propagated to all nodes less than 2 seconds after that. However other carriers beat telia to their own route!, If you look at ORD (Chicago) and MIA (Miami) you can see other carriers pick up the route from telia at another location, and hand it over to our provider before 20 seconds later, it arrives as a direct route.
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0 | 0 | Origin | 1299 | No |
632.3ms | 0.632 | cdg | 1299 | Yes |
11.3ms | 0.643 | fra | 1299 | Yes |
107.1ms | 0.75 | sea | 1299 | No |
76.8ms | 0.827 | ams | 1299 | Yes |
17.6ms | 0.845 | lhr | 1299 | No |
40.5ms | 0.886 | yyz | 1299 | No |
27.6ms | 0.914 | mia | 1299 | No |
59.9ms | 0.974 | sjc | 1299 | No |
8.2ms | 0.982 | dfw | 1299 | Yes |
9.4ms | 0.991 | lax | 1299 | No |
10.9ms | 1.002 | ewr | 1299 | Yes |
5.9ms | 1.008 | yyz | 1299 | Yes |
61.5ms | 1.07 | lhr | 1299 | Yes |
137.1ms | 1.207 | ord | 1299 | No |
211.8ms | 1.419 | nrt | 1299 | No |
12.3ms | 1.431 | sin | 1299 | No |
499ms | 1.93 | nrt | 1299 | No |
198.6ms | 2.129 | syd | 1299 | No |
21135.5ms | 23.265 | mia | 1299 | Yes |
3196.4ms | 26.461 | ord | 1299 | Yes |
Level 3 (AS3356) does by far the worst in this test, taking 18 seconds from announcing the test prefix to it until it appears anywhere on the internet, and it appears in SEA (Seattle) of all places, and then from there other carriers pick up that route and propagate it faster than Level 3. Some 30 seconds laster Level 3 has caught up and the route is now seen in all places with Level 3 peering.
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0 | 0 | Origin | 3356 | No |
18508.1ms | 18.508 | sea | 3356 | Yes |
365.6ms | 18.874 | yyz | 3356 | Yes |
241.9ms | 19.116 | lhr | 3356 | No |
251.2ms | 19.367 | cdg | 3356 | No |
174.5ms | 19.541 | mia | 3356 | No |
87.3ms | 19.628 | fra | 3356 | No |
6.1ms | 19.634 | sin | 3356 | No |
6.9ms | 19.641 | ewr | 3356 | No |
53.3ms | 19.694 | ams | 3356 | No |
212.3ms | 19.906 | dfw | 3356 | No |
72.1ms | 19.978 | lax | 3356 | No |
0.9ms | 19.979 | nrt | 3356 | No |
187.5ms | 20.166 | sjc | 3356 | No |
194.1ms | 20.36 | syd | 3356 | No |
10094.8ms | 30.455 | ewr | 3356 | Yes |
3963.6ms | 34.419 | mia | 3356 | Yes |
1207.5ms | 35.627 | ord | 3356 | No |
1684.1ms | 37.311 | sjc | 3356 | Yes |
476.2ms | 37.787 | fra | 3356 | Yes |
434.5ms | 38.222 | dfw | 3356 | Yes |
1264.5ms | 39.487 | ord | 3356 | No |
5106.7ms | 44.594 | ams | 3356 | Yes |
1426.2ms | 46.02 | ord | 3356 | Yes |
695.5ms | 46.715 | lhr | 3356 | Yes |
3801.6ms | 50.517 | cdg | 3356 | Yes |
Last but not least is AS2914 (NTT Communications). Who while is not the fastest at sending routes global, they did appear to be the most smooth and consistent.
Last update | Wall time (s) | Location | Upstreaming AS | Collector directly connected |
0 | 0 | Origin | 2914 | No |
814.9ms | 0.815 | ewr | 2914 | Yes |
90.6ms | 0.906 | fra | 2914 | Yes |
290.2ms | 1.196 | lax | 2914 | Yes |
0.6ms | 1.197 | cdg | 2914 | Yes |
121.4ms | 1.318 | yyz | 2914 | No |
5.1ms | 1.323 | lhr | 2914 | Yes |
24ms | 1.347 | nrt | 2914 | Yes |
34ms | 1.381 | ams | 2914 | No |
1.1ms | 1.382 | dfw | 2914 | No |
29.6ms | 1.412 | sea | 2914 | No |
81.4ms | 1.493 | sjc | 2914 | No |
68.8ms | 1.562 | sea | 2914 | Yes |
63.4ms | 1.625 | ord | 2914 | No |
57.1ms | 1.682 | mia | 2914 | No |
75.7ms | 1.758 | ams | 2914 | Yes |
108.9ms | 1.867 | sjc | 2914 | Yes |
196.8ms | 2.064 | mia | 2914 | Yes |
75.8ms | 2.14 | syd | 2914 | Yes |
54.9ms | 2.195 | dfw | 2914 | Yes |
0ms | 2.195 | ord | 2914 | Yes |
17.5ms | 2.212 | sin | 2914 | Yes |
Now that we have done all the carriers you may think that is it, however there is a different kind of propagation we can observe:
As we can race the network in sending out bgp routes, we can also race them withdrawing them!
This is a test that is harder to see on the routing table itself, so it’s easier (and much more fun) to observe it by simply doing a traceroute to a prefix and then withdrawing it from all providers:
Here you can see the route slowly being released out of all of the carriers, and then the carrier backbones, and then the carrier inter peering relationships. It also exposes some interestingly strange routing too as options to route a prefix begin to run out!
Anyway, as I said to the audience, We have had the fast bit, now we can have the furious part! If you generally like this kind of post, I aim to post once a month on various (mostly networking related) matters. If you want to stay up to date with that, you can either use my blog’s rss or you can follow me on twitter for updates when the next post happens.
I would like to thank AS57782 / Cynthia Revstrom for lending some IPv4 space for this post, and helping out on the traceroute demo you see above.
If you do have any questions about this talk, please feel free to reach out on the email that is on the slide above! Until next time!
Related Posts:
Playing battleships over BGP (2018)
What would a EvE online Internet look like? (2019)
Random Post:
Splitting the ping (2021)