Jul 3 2015

Propagation slow? Sound the alarms!

This is a blog post that I had written for my employer CloudFlare You can find the full link here

CloudFlare operates a huge global network of servers that proxy our customers web site, operate as caches, inspect requests to ensure they are not malicious, deflect DDoS attacks and handle one of the largest authoritative DNS systems in the world. And where there’s software there’s configuration information.

CloudFlare is highly customisable. Each customer has a unique configuration consisting of DNS record, all manner of settings (such as magnification, image recompression, IP-based blocking, which individual WAF rules to execute) and per-URL rules. And the configuration changes constantly.

And we offer almost instant configuration changes. If a user adds a DNS record it should be globally resolvable in seconds. If a user enables the CloudFlare WAF it should happen very, very fast to protect a site. This presents a challenge because those configuration changes need to be pushed across the globe very quickly.

We’ve written in the past about the underlying technology we use: Kyoto Tycoon and how we secured it from eavesdroppers. We also monitor its performance.

DNS records are currently changing at a rate of around 40 per second, 24 hours a day. All those changes need to be propagated in seconds.

So we take propagation times very seriously.

For this we need to keep a close eye on how long it takes a change to reach every one of our data centers. While we have in depth metrics for our operations team to look at and an entire alerting system it’s sometimes useful (and fun) to have something more visceral.

We also want developers and operations people to equally be aware of some critical metrics, and developers are spending their time observing the metrics and alerts aimed at operations.

On some rare occasions, perhaps due to routing problems on the wider Internet, we may find that our ability to push changes at the required velocity becomes impractical. To ensure that we know about this as soon as possible and know when to take action we’ve built a custom alert system that everyone in the office can see.

From an external global collection of machines we monitor propagation time for DNS records and trigger an alert if propagation time exceeds a pre-set threshold. The alert comes in the form of a blue rotating ‘police light’.

We had joked about having a “red alert” alarm when we fall behind on propagation and so I turned that joke into reality.

therozzers

A Raspberry Pi hidden in an old hard drive case connects to our global monitors and obtains the current propagation time (as measured from outside our network). The Pi is connect (via a transistor acting as a switch) to a cheap mini police light that’s visible throughout the office.