I’m sure this has been done before, but you do hear stories from time to time where someone will either drop or increase their DNS TTL’s and either see a massive difference, or none at all.
A lot of providers in the wild will set their DNS TTL’s from anywhere from 5 mins, to a hour, Setting the DNS TTL too low is considered bad because it could cause un-needed fetches, but at the same time, setting it too high could hurt you in the case of outages, and if you are billed by your DNS provider by the query, it could end up costing you money too.
My own thinking was, If you set a record to have a TTL of a day, what is the chance it will even last that long? Giving that most DNS resolvers are in tiny home routers with very little ram, and even then would have a very small chunk assigned to DNS query cache. My hypothesis is that queries barely last a hour in 95% of cases.
To test this, I set up a script to update a record on one of my domains for a TTL every 10 seconds with the current unix time stamp, and serve DNS records with a TTL of one day, and then use the RIPE Atlas system ( you can learn more about that here ) to have 100 probes from different parts of the world query that DNS record every 88 seconds ( I could not poll faster due to limitations in RIPE’s own system and the rate you can spend credits, very, very annoying )
To ensure that the measurements are proportionate, I set the geographic distribution to even amounts around the world:
I set the measurement to last 25 hours, and let it go, after that I was left with a json line per DNS result that looked like this:
{
"from": "xxx.xxx.xxx.xxx",
"fw": 4740,
"group_id": 7803606,
"lts": 14,
"msm_id": 7803606,
"msm_name": "Tdig",
"prb_id": xxxxx,
"resultset": [
{
"af": 4,
"dst_addr": "xxx.xxx.xxx.xxx",
"lts": 14,
"proto": "UDP",
"result": {
"ANCOUNT": 1,
"ARCOUNT": 0,
"ID": 42714,
"NSCOUNT": 2,
"QDCOUNT": 1,
"abuf": "=",
"answers": [
{
"NAME": "delay-long.flm.me.uk",
"RDATA": [
"1486139691"
],
"TYPE": "TXT"
}
],
"rt": 0.895,
"size": 116
},
"src_addr": "xxx.xxx.xxx.xxx",
"subid": 1,
"submax": 2,
"time": 1486140529
},
{
"af": 4,
"dst_addr": "194.149.131.2",
"lts": 15,
"proto": "UDP",
"result": {
"ANCOUNT": 1,
"ARCOUNT": 4,
"ID": 43549,
"NSCOUNT": 2,
"QDCOUNT": 1,
"abuf": "=",
"answers": [
{
"NAME": "delay-long.flm.me.uk",
"RDATA": [
"1486139691"
],
"TYPE": "TXT"
}
],
"rt": 1.132,
"size": 204
},
"src_addr": "xxx.xxx.xxx.xxx",
"subid": 2,
"submax": 2,
"time": 1486140530
}
],
"timestamp": 1486140529,
"type": "dns"
}
Thankfully RIPE Atlas exposes basically all of the information you could want in this case, however in this case, we are only really interested in the RDATA inside the answers
section of the line.
After waiting a whole 25 hours I downloaded the dataset, and ran it over a small go program that checked each probe to see how long it’s DNS resolver actually held the item in cache for:
With some added sort(1) we get this (first number on each line is the maximum minutes a probe held onto a item in cache):
$ ./RIPE-Blogpost -csv=true RIPE-Atlas-measurement-7803606.json | sort -n
timeheld,probeid,resolvers
22,13910,map[192.168.6.193:1018]
25,13623,map[198.18.1.1:1018]
34,24913,map[202.70.64.33:999]
44,11712,map[192.168.1.1:1004]
59,21035,map[212.77.128.131:1018]
117,14691,map[192.168.0.1:1018]
176,25253,map[192.168.10.5:121]
204,19135,map[8.8.8.8:1018]
247,21824,map[8.8.8.8:1017]
274,20534,map[8.8.8.8:1018]
323,10262,map[192.168.2.1:1018]
362,23337,map[8.8.8.8:1014]
394,13536,map[208.67.220.220:1017]
406,14237,map[192.168.1.1:1017]
437,10287,map[8.8.8.8:1007]
465,19428,map[8.8.8.8:1017]
474,14788,map[8.8.8.8:1017]
576,12336,map[8.8.4.4:1017]
600,28177,map[8.8.8.8:1018]
719,28560,map[172.22.255.1:1017]
771,14680,map[192.168.20.1:610]
936,24639,map[208.67.220.220:1018]
1043,24425,map[192.168.11.1:707]
1282,14511,map[192.168.50.1:1017]
1283,24180,map[192.168.1.1:1018]
1295,29095,map[192.168.1.1:1018]
1296,22300,map[192.168.0.1:1018]
1351,12927,map[190.160.0.11:1017]
1358,14810,map[61.207.11.153:1006]
1367,13672,map[192.168.222.1:968]
1370,11049,map[192.168.1.254:1017]
1372,26836,map[208.67.222.222:1018]
1386,17841,map[8.8.4.4:1017]
1415,14641,map[10.0.1.1:1018]
1417,19921,map[192.168.1.1:1018]
1424,11633,map[192.168.1.1:1018]
1425,12207,map[168.210.2.2:1017]
1426,10113,map[194.149.131.50:1017]
1426,14724,map[192.168.100.11:1018]
1426,18466,map[195.202.138.3:1013]
1426,19315,map[10.0.15.2:1018]
1426,20478,map[213.163.76.185:1018]
1426,21101,map[202.159.32.2:1018]
1426,22635,map[100.64.99.9:1017]
1426,22760,map[202.59.4.2:1018]
1426,23222,map[89.105.161.1:1018]
1426,23452,map[192.168.1.1:1016]
1426,25197,map[208.67.220.220:1017]
1427,19019,map[172.16.69.1:1018]
1428,15874,map[137.226.13.45:951]
1430,18499,map[210.220.163.82:1018]
1430,27299,map[192.168.200.1:100]
1435,19131,map[10.0.0.1:1018]
1436,20829,map[61.41.153.2:1018]
1439,10238,map[192.168.0.193:1017]
1439,10247,map[192.168.1.1:1018]
1439,11040,map[192.168.0.1:1018]
1439,11056,map[192.168.1.195:1007]
1439,12873,map[194.186.111.229:1016]
1439,12956,map[192.168.1.1:1018]
1439,13122,map[132.205.96.93:1018]
1439,13239,map[41.222.192.9:1018]
1439,13799,map[41.76.0.110:1018]
1439,13804,map[168.210.2.2:1018]
1439,13806,map[168.210.2.2:1017]
1439,14384,map[192.168.20.1:1017]
1439,14558,map[203.169.24.25:1018]
1439,14950,map[63.221.246.95:1018]
1439,18131,map[168.210.2.2:1013]
1439,18369,map[192.168.100.1:1018]
1439,18491,map[172.27.129.4:1017]
1439,18514,map[197.148.74.18:1017]
1439,18525,map[168.126.63.1:1018]
1439,19591,map[41.207.188.10:1017]
1439,19649,map[192.168.200.1:1017]
1439,19725,map[200.75.0.4:1018]
1439,19779,map[192.168.100.1:1018]
1439,20092,map[10.1.1.1:1018]
1439,20495,map[192.168.254.1:1018]
1439,20794,map[172.30.172.1:1018]
1439,21001,map[137.82.1.2:1018]
1439,21682,map[192.168.2.1:1018]
1439,22252,map[41.203.18.183:1013]
1439,22789,map[10.0.1.2:1017]
1439,22804,map[183.81.133.150:1017]
1439,23002,map[194.214.253.247:1018]
1439,23016,map[192.168.15.1:1016]
1439,23706,map[10.32.11.34:1018]
1439,25296,map[2c0f:feb0::1:1018]
1439,28205,map[27.131.58.11:1018]
1439,28662,map[192.168.1.1:1018]
1439,28935,map[192.168.50.1:1017]
1439,31868,map[192.168.179.1:1017]
35% of probes can’t hold on to DNS entries in cache for 24 hours.
What I think is most interesting is that the lowest performers are Google’s public DNS 8.8.8.8 and 8.8.4.4. I suspect this is due to the issue of scale and sharing caches at scale. A lot of the resolvers were in LAN address space, so I suspect they cache correctly for 24 hours while, their upstreams do not.
Now that I know that 65% of resolvers can keep hold of things for 24 hours, how mad can we go? A week?
As it happens, you can, I ran the same test but with a custom nameserver because Cloudflare (for sane reasons) does not allow you to set a TTL for more than 24 hours, and waited a week, We got this as a result:
timeheld,probeid,resolvers
3,13910,map[192.168.6.193:6867]
60,21035,map[212.77.128.131:6872]
522,13536,map[208.67.220.220:6859]
961,13804,map[168.210.2.2:6863]
961,13806,map[168.210.2.2:6853]
1003,13623,map[198.18.1.1:6873]
1273,24639,map[208.67.220.220:6864]
1297,22300,map[192.168.0.1:6871]
1436,13672,map[192.168.222.1:6767]
1437,19779,map[192.168.100.1:3976]
1437,23337,map[8.8.8.8:6835]
1439,11040,map[192.168.0.1:6869]
1439,11049,map[192.168.1.254:6873]
1439,12207,map[168.210.2.2:6866]
1439,14237,map[192.168.1.1:6873]
1439,14384,map[192.168.20.1:6866]
1439,14511,map[192.168.50.1:6871]
1439,14724,map[192.168.100.11:6873]
1439,14788,map[8.8.8.8:6873]
1439,14950,map[63.221.246.95:6864]
1439,18369,map[192.168.100.1:6870]
1439,18514,map[197.148.74.18:6868]
1439,19019,map[172.16.69.1:2846]
1439,19131,map[10.0.0.1:6841]
1439,19428,map[8.8.8.8:6872]
1439,19649,map[192.168.1.1:4020]
1439,20092,map[10.1.1.1:6873]
1439,20478,map[213.163.76.185:3929]
1439,20534,map[8.8.8.8:6872]
1439,21101,map[202.159.32.2:6865]
1439,21682,map[192.168.2.1:6866]
1439,21824,map[8.8.8.8:6873]
1439,21849,map[192.168.1.170:25 8.8.8.8:6355]
1439,22221,map[41.203.18.183:6815]
1439,22252,map[41.203.18.183:6802]
1439,22635,map[100.64.99.9:6873]
1439,23016,map[192.168.15.1:6868]
1439,25253,map[192.168.10.5:2945 172.16.32.5:687]
1439,28177,map[8.8.8.8:6859]
1440,28662,map[192.168.1.1:6494]
1448,12336,map[8.8.4.4:6800]
1449,17841,map[8.8.4.4:6864]
1452,19135,map[8.8.8.8:6872]
1459,14691,map[192.168.0.1:6865]
1484,10262,map[192.168.2.1:6813]
1527,10287,map[8.8.8.8:6869]
1640,19921,map[2001:470:28:4ea::1:6836 192.168.1.1:4]
1794,11633,map[192.168.1.1:6867]
1958,13799,map[41.76.0.110:6864]
2090,13239,map[41.222.192.9:5819]
2203,24180,map[192.168.1.1:6873]
2477,14680,map[192.168.20.1:6837]
3859,11056,map[192.168.1.195:6390]
4319,14641,map[10.0.1.1:6872]
5536,12956,map[192.168.1.1:6866]
6185,23222,map[89.105.161.1:6872]
6515,28205,map[27.131.58.11:6870]
6574,19591,map[41.207.188.10:6871]
6678,23706,map[10.32.11.34:6871 8.8.8.8:2]
7075,14558,map[203.169.24.25:6873]
7574,18525,map[168.126.63.1:6872]
7578,10238,map[192.168.0.193:6873]
7788,20794,map[172.30.172.1:6873]
9215,14810,map[61.207.11.153:6871]
9231,25296,map[2c0f:feb0::1:6866]
9451,12927,map[190.160.0.11:6854]
9529,18491,map[172.27.129.4:6873]
9599,13122,map[132.205.96.93:6610]
10032,20829,map[61.41.153.2:6323]
10064,18499,map[210.220.163.82:6829]
10078,18466,map[195.202.138.3:6860]
10079,10113,map[194.149.131.50:6873]
10079,12873,map[194.186.111.229:6873]
10079,18131,map[168.210.2.2:6845]
10079,19725,map[200.75.0.4:6873]
10079,20495,map[192.168.254.1:6873]
10079,21001,map[137.82.1.2:6872]
10079,22789,map[10.0.1.2:6873]
10079,22804,map[183.81.133.150:6873]
10079,23002,map[194.214.253.247:6873]
10079,23452,map[192.168.1.1:6871]
10079,25197,map[208.67.220.220:6860]
10079,31868,map[192.168.179.1:6867]
Okay, So interesting, 12 of the 83 probes that made it ( the same nodes as above to ensure fairness ) can hold a DNS record in cache for the full week!
Is there something you can do with those probes that do keep cache for 24+ hours? Perhaps some awful pingfs type file storage? I may see another project coming soon :)