< All posts | Fediverse | RSS | GitHub | Talks

Apr 30 2020

Stressing the network when it’s already down

A rough demo of what it feels like when your customers make more load as a result of you being overloaded Impact by Eric Wienke, edits by Ben Cartwright-Cox ;)/ CC BY-NC-SA

A few days ago shortly after 17:16 BST, a handful of networks owned by Liberty Global (though the biggest impact was to their UK network known as Virgin Media) started having issues reaching the rest of the internet. What the exact official cause of this is currently only rumor. However the outage had an interesting pattern attached to it, in which their network would fail almost every hour at 17 mins past (17:17 / 18:17 / 19:17 / 21:17 / 23:17 / 00:17). Producing this interesting graph of systems going offline, generated from a system I use to monitor my networks reachability to the rest of the internet:

Grafana graph showing dips in reachability

Other providers have also posted their point of view of the outages, but with bandwidth drops being shown instead:

Twitter screenshot: Yesterday doesn't look like it was much fun for network, operations and support folks at Virgin Broadband.

One interesting thing about the outages is that they all started at the same time, and took a similar amount of time to resolve.

A possible cause might be a destructive crontab firing, since the hourly crontab folder runs at exactly 17 mins past the hour:

[20:38:43] ben@metropolis:~$ grep hourly /etc/crontab
17 *	* * *	root    cd / && run-parts --report /etc/cron.hourly

I guess we will have to wait for a Reason For Outage (RFO) report from them to know for sure.

Meanwhile during the outage, people in backchannels were noticing that they were seeing traffic pickups to Virgin Media, causing speculation that it was initially attack traffic driven. However on deeper inspection this appeared to be their networks Speedtest.net servers! These graphs generated from data given to me by Jump Networks shows speedtest server traffic going to/from Virgin Media with the following profile:

Jump net graph

It would appear that every time that Virgin Media dropped off, people en masse flocked to speedtest services to confirm that their internet connection was having problems.

If you go down to the NetFlow level, you can even see very clearly when services were restored for customers:

Netflow connection graph

Time SpeedTest.net flows
2020-04-27T16:16:55 2
2020-04-27T16:16:56 5
2020-04-27T16:16:57 2
2020-04-27T16:16:58 1
2020-04-27T16:17:00 2
2020-04-27T16:17:01 0
2020-04-27T16:17:02 0
No traffic continues
2020-04-27T16:20:16 0
2020-04-27T16:20:17 47
2020-04-27T16:20:18 62
2020-04-27T16:20:19 46
2020-04-27T16:20:20 70
2020-04-27T16:20:21 137
2020-04-27T16:20:22 111
2020-04-27T16:20:23 99
2020-04-27T16:20:24 73
2020-04-27T16:20:25 79
2020-04-27T16:20:26 122
2020-04-27T16:20:27 78
2020-04-27T16:20:28 115
2020-04-27T16:20:29 82
2020-04-27T16:20:30 56
2020-04-27T16:20:31 122

This kind of collective behaviour is fascinating to me, and also presents an interesting customer driven positive feedback loop for networks that might be having temporary congestion problems, where people that are verifying that the network is congested, are themselves adding more congestion to the network.

Or put in the form of drawing:

The speed test feedback loop

This is not to say that this was the issue that caused the Virgin Media outages, since these spikes started after connectivity was restored, not during the outage window.

Some of this behaviour reminds me of that one time Apple’s captive.apple.com had reachability issues, causing cell networks almost instantly to run out of capacity due to iPhones collectively concluding that their local WiFi connections were broken and switching to cellular data to mitigate imaginary connection problems. Or when someone accidentally caused some devices out in the field to all call home at exactly the same time, overwhelming the local cellular network.

Actions that may seem harmless as one, can quickly become harmful if automated or done in synchronous by a large group of people, and while it’s easy to fix the automated ones, it’s much harder to fix people.

I would like to thank James Rice from Jump Networks for the data that backed this blog post.

If this is your kind of stuff, you may find other bits you like on the rest of the blog. If you want to stay up to date with my ramblings or projects you can use my blog’s RSS Feed or you can follow me on twitter.