< All posts | Fediverse | RSS | GitHub | Talks

May 4 2019

The speed of BGP network propagation

At the end of March 2019 I did a talk at the INEX’s (Ireland’s biggest internet exchange point) Annual General Meeting. I was supposed to record it, however in a brief panic due to HDMI not working on my laptop I forgot to start a recording of it. Since people found it interesting I figured I would turn it into a blog post instead:

Intro side that is a rehash of a fast and furious poster with JunOS placed on a car

I spent so much time on this opening side that I feel the need to include it even if I’m not doing an introduction this time.

a long time ago... in a job a year back

So a long time ago… in a job fa- well a year back. We were dealing with the lovely routing design of anycast.

Anycast

For those who need a quick primer, it’s a routing design that allows you to do a more natural regional based load balancing, allowing you to put server clusters in different regions and serve traffic local to those regions without having to play tricks with DNS:

Many nodes on the map

This works by having all participating nodes announce the same IP prefixes globally, and with some careful routing tuning (mainly careful selection of upstream transit/peering providers) you can get good load balancing and latency results due to traffic being served more local to the visitor’s region.

Many nodes announcing the same IP prefix

Sadly, a lot of networks struggle to get consistent network announcements to work right, often resulting in totally backwards-from-logic routing:

Cat bowl based anycast routing

However even those networks that do get most regions down, regions like Asia are much harder to get to route correctly, partly due to local ISPs either dealing with overloaded links or their links not always following logical geographic points.

Two cats eating from the same bown, when one cat really should be eating from the other

The crux of the problem is ensuring your routing announcements are consistent over all regions and over almost all major interconnection ISPs (Tier 1’s)

Routing providers changing, but the AS_PATH staying the same length

In simple setups, this really just means you need to keep your AS_PATH’s as close to the same in all the regions and carriers you want to have routing control over.

all countries announcing the same AS_PATH

This basically just means you should be attempting to use the same providers and traffic engineering parameters in all regions, AS_PATH prepending being one of the more basic ones.

one country annoncing a shorter path than the rest, then bursting into flames

However, as systems get larger and more complex, eventually a mistake is going to be made. In the case of the job, a configuration misunderstanding during maintenance of a router caused it to drop a traffic engineering prepend. This caused a huge traffic shift globally towards this router, and almost instantly overloading the site.

This was a regrettable incident, and it became clear that while the traffic engineering prepend was useful in the past, at that point in the network it was more of a liability than a useful tool. So it was time to remove it.

But what if we were to make the same mistake again? This time we are changing a lot more router configuration at once. It’s worth thinking about failure modes here. There are two ways the change could fail:

one country announcing a longer AS_PATH than the rest, causing traffic to move away

The first one being that the change applies to a large percentage of the routers over the network, but some of them fail. This would cause traffic to mostly shift away from those locations and head to other nearby sites. As long as not too many sites have this issue, this is the best way it can fail.

one country on fire due to announcing a too short of a AS_PATH

The nastier way it can fail is that most routers don’t end up being changed but a small percentage of them do.

This will be a repeat of the first incident, however more routers will be involved, and we would be dealing with a lot more routers that would need a rescue configuration rollback or hotfix.

The story of them changing this in a sane way is not mine to tell, however someone at the front of this change did a talk at RIPE NCC’s bi-annual event about how it was done:

click here to go to that talk

The good news is, the change went through fine, and no router got left behind! A small amount of traffic churn happened while routers globally had to update their routing tables and inform the other internal routers they were connected to about that.

During this time, it was observed that not all of the providers accepted this change at the same time, some providers seemed to reconverge almost instantly, but others were noticeably slow.

This begs the question, how long does this sort of thing generally supposed to take? And are some providers better than other, who is the fastest? Who is the slowest?

However, for this we need to define what it means for a route to be propagated. There are two valid ways (in my eyes) this could be:

First Announcement wins diagram

The “First Announcement Wins” method is quite literally what it says on the tin. When we see a BGP update message for our prefix, that provider+location combo wins (or if they are late, loses)

This could be slightly flawed, since some networks might have hard to observe mechanisms for quickly sending routing information inside their network, but those initial route updates internally may not be sensible network paths.

First Stable Announcement wins diagram

In “First stable announcement wins” testing is done to ensure that whatever route becomes “stable” (stops changing its internal routing in the provider backbone) is declared the winner.

In my eyes this what most network engineers are looking for, however it also has a large issue attached to it

Internals of networks are unknown and often magic

Figuring out what is a stable route is a non-trivial amount of complexity, and no matter what way I do it, I think it is not measurable to the 10’s of milliseconds.

For this reason, the experiment we are doing is using “First Announcement Wins.”

The propagation race works like so:

slide containing the setup of the bgp race

The high precision timestamps are important here, and a detail that actually ended up being slightly devastating for the first few runs due to the inaccuracy of system clocks.

juniper supervisor card with the 10mhz and PPS ports highlighted

You see, I now have a stronger respect (in that I now actually believe they have worth) for the PPS and 10Mhz clock inputs on a lot of high-end carrier routers. Since time syncing is actually incredibly hard when you go above more than 2 systems. Locking all systems to a stable clock source is immensely nice, and before you ask, NTP does not really get that close in real life situations with a wide range of geographically separated targets.

After a lot of time syncing and timestamp offset correction, I ended up with a linear list of announcements by server location (airport codes to signify where they are, since that’s generally what the networking industry seems to use)

a listing of all announcements in tsv format

For AS6453 (Tata communications) times looked decent to start with. Giving that none of the BGP route update collector nodes had Tata as a direct provider, this is basically racing how fast Tata’s peers can send routes around.

It’s interesting that it seems to have a 500ms ish minimum, but after that routes start to move around the globe very fast, with the exception of EWR (New York area), Likely an outlier.

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0

0

Origin

6453

No

523.2ms

0.523

sea

6453

No

62.5ms

0.586

cdg

6453

No

6.8ms

0.593

fra

6453

No

25ms

0.618

lax

6453

No

107.5ms

0.725

yyz

6453

No

92.8ms

0.818

nrt

6453

No

12.9ms

0.831

sjc

6453

No

9ms

0.84

mia

6453

No

5.4ms

0.845

ams

6453

No

58.4ms

0.903

lhr

6453

No

22.9ms

0.926

dfw

6453

No

68.4ms

0.994

ord

6453

No

26.5ms

1.021

fra

6453

No

45ms

1.066

sin

6453

No

455.3ms

1.521

lhr

6453

No

191.2ms

1.712

syd

6453

No

19764.8ms

21.477

dfw

6453

No

171947.8ms

193.425

ewr

6453

No

For AS174 (Cogent Communications) it seems that propagation takes a little longer, due to policy on the upstream ISP used for route collection, Cogent only was imported from other carriers. So there is a similar effect with Tata here. However it is odd that Toronto (YYZ) sees the route first after announcement, Since the announcement is done in London (LHR).

This is likely the impact of a route reflector or something inside the network

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0

0

Origin

174

No

2425.1ms

2.425

yyz

174

No

2202.1ms

4.627

yyz

174

No

3328ms

7.955

yyz

174

No

1141.3ms

9.096

sea

174

No

85.1ms

9.181

lhr

174

No

158ms

9.339

syd

174

No

230.7ms

9.57

dfw

174

No

56.2ms

9.626

ewr

174

No

65.8ms

9.692

nrt

174

No

107.7ms

9.8

fra

174

No

18.7ms

9.819

lax

174

No

49.8ms

9.869

mia

174

No

33.3ms

9.902

cdg

174

No

18.2ms

9.92

ams

174

No

74.2ms

9.994

sjc

174

No

4ms

9.998

sin

174

No

16898.5ms

26.897

ord

174

No

531.6ms

27.429

dfw

174

No

For AS3257 (GTT) we are finally seeing some timing data that is based on providers we are locally connected to. GTT does seem to send things around the world reasonably fast, at a shiny 1.9 seconds (apart from EWR, thus supporting that this is more of a data point error rather than anything else)

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0ms

0

Origin

3257

No

721.1ms

0.721

yyz

3257

No

64.2ms

0.785

lax

3257

Yes

9.4ms

0.794

dfw

3257

Yes

52.9ms

0.847

sea

3257

Yes

21.4ms

0.868

sjc

3257

Yes

44.9ms

0.913

mia

3257

Yes

15.8ms

0.929

nrt

3257

No

82.9ms

1.012

fra

3257

No

20.1ms

1.032

cdg

3257

Yes

36.6ms

1.069

sin

3257

No

19.7ms

1.089

ams

3257

No

19.1ms

1.108

lhr

3257

No

256.1ms

1.364

syd

3257

No

19.1ms

1.383

ord

3257

Yes

208.5ms

1.592

lhr

3257

No

281.2ms

1.873

fra

3257

Yes

88.7ms

1.962

nrt

3257

No

114745ms

116.708

ewr

3257

No

AS1299 (Telia) has a more logical timing, 0.6 seconds after we announce in London it appears in Paris and Frankfurt directly and it is fully propagated to all nodes less than 2 seconds after that. However other carriers beat telia to their own route!, If you look at ORD (Chicago) and MIA (Miami) you can see other carriers pick up the route from telia at another location, and hand it over to our provider before 20 seconds later, it arrives as a direct route.

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0

0

Origin

1299

No

632.3ms

0.632

cdg

1299

Yes

11.3ms

0.643

fra

1299

Yes

107.1ms

0.75

sea

1299

No

76.8ms

0.827

ams

1299

Yes

17.6ms

0.845

lhr

1299

No

40.5ms

0.886

yyz

1299

No

27.6ms

0.914

mia

1299

No

59.9ms

0.974

sjc

1299

No

8.2ms

0.982

dfw

1299

Yes

9.4ms

0.991

lax

1299

No

10.9ms

1.002

ewr

1299

Yes

5.9ms

1.008

yyz

1299

Yes

61.5ms

1.07

lhr

1299

Yes

137.1ms

1.207

ord

1299

No

211.8ms

1.419

nrt

1299

No

12.3ms

1.431

sin

1299

No

499ms

1.93

nrt

1299

No

198.6ms

2.129

syd

1299

No

21135.5ms

23.265

mia

1299

Yes

3196.4ms

26.461

ord

1299

Yes

Level 3 (AS3356) does by far the worst in this test, taking 18 seconds from announcing the test prefix to it until it appears anywhere on the internet, and it appears in SEA (Seattle) of all places, and then from there other carriers pick up that route and propagate it faster than Level 3. Some 30 seconds laster Level 3 has caught up and the route is now seen in all places with Level 3 peering.

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0

0

Origin

3356

No

18508.1ms

18.508

sea

3356

Yes

365.6ms

18.874

yyz

3356

Yes

241.9ms

19.116

lhr

3356

No

251.2ms

19.367

cdg

3356

No

174.5ms

19.541

mia

3356

No

87.3ms

19.628

fra

3356

No

6.1ms

19.634

sin

3356

No

6.9ms

19.641

ewr

3356

No

53.3ms

19.694

ams

3356

No

212.3ms

19.906

dfw

3356

No

72.1ms

19.978

lax

3356

No

0.9ms

19.979

nrt

3356

No

187.5ms

20.166

sjc

3356

No

194.1ms

20.36

syd

3356

No

10094.8ms

30.455

ewr

3356

Yes

3963.6ms

34.419

mia

3356

Yes

1207.5ms

35.627

ord

3356

No

1684.1ms

37.311

sjc

3356

Yes

476.2ms

37.787

fra

3356

Yes

434.5ms

38.222

dfw

3356

Yes

1264.5ms

39.487

ord

3356

No

5106.7ms

44.594

ams

3356

Yes

1426.2ms

46.02

ord

3356

Yes

695.5ms

46.715

lhr

3356

Yes

3801.6ms

50.517

cdg

3356

Yes

Last but not least is AS2914 (NTT Communications). Who while is not the fastest at sending routes global, they did appear to be the most smooth and consistent.

Last update

Wall time (s)

Location

Upstreaming AS

Collector directly connected

0

0

Origin

2914

No

814.9ms

0.815

ewr

2914

Yes

90.6ms

0.906

fra

2914

Yes

290.2ms

1.196

lax

2914

Yes

0.6ms

1.197

cdg

2914

Yes

121.4ms

1.318

yyz

2914

No

5.1ms

1.323

lhr

2914

Yes

24ms

1.347

nrt

2914

Yes

34ms

1.381

ams

2914

No

1.1ms

1.382

dfw

2914

No

29.6ms

1.412

sea

2914

No

81.4ms

1.493

sjc

2914

No

68.8ms

1.562

sea

2914

Yes

63.4ms

1.625

ord

2914

No

57.1ms

1.682

mia

2914

No

75.7ms

1.758

ams

2914

Yes

108.9ms

1.867

sjc

2914

Yes

196.8ms

2.064

mia

2914

Yes

75.8ms

2.14

syd

2914

Yes

54.9ms

2.195

dfw

2914

Yes

0ms

2.195

ord

2914

Yes

17.5ms

2.212

sin

2914

Yes

Now that we have done all the carriers you may think that is it, however there is a different kind of propagation we can observe:

the dying breaths of a route

As we can race the network in sending out bgp routes, we can also race them withdrawing them!

This is a test that is harder to see on the routing table itself, so it’s easier (and much more fun) to observe it by simply doing a traceroute to a prefix and then withdrawing it from all providers:

a gif showing a route slowly draining out of the routing table of many carriers

Here you can see the route slowly being released out of all of the carriers, and then the carrier backbones, and then the carrier inter peering relationships. It also exposes some interestingly strange routing too as options to route a prefix begin to run out!

the ending/questions slide

Anyway, as I said to the audience, We have had the fast bit, now we can have the furious part! If you generally like this kind of post, I aim to post once a month on various (mostly networking related) matters. If you want to stay up to date with that, you can either use my blog’s rss or you can follow me on twitter for updates when the next post happens.

I would like to thank AS57782 / Cynthia Revstrom for lending some IPv4 space for this post, and helping out on the traceroute demo you see above.

If you do have any questions about this talk, please feel free to reach out on the email that is on the slide above! Until next time!