< All posts | Fediverse | RSS | GitHub | Talks

Sep 5 2015

robots.txt usage over the Alexa million

If you ever had to deal with bots while running a site you will have at least at some point looked into robots.txt, a system that isn’t really defined at all but still used by most sites. Google has published a bunch of guidelines on what they will accept, and so did other search engines (Yandex, Bing)

Most of the guides above are aimed at making sure that the search engine bot does not crawl stuff that either will cause issues, or that you do not want indexed for a number of reasons (duplicate content, admin pages, system status pages etc).

However, for some of these assets, you are making it very clear either what your site is built on or, in some cases, where the admin portal is (and thus making it slightly more visible to anyone who would want to attack it). You can also find censored content there, for when a company no longer wants an asset on their site to be indexed by a search engine for legal reasons such as a court order.

There are also a bunch of bots on the internet that do other things that are not search engine indexing: they can either be price indexing, social media embedding bots (Facebook, Twitter, Flipbord), SEO assessment sites/services, archiving systems (Archive.org is the best example of this) that site owners may want to remove or stop their traffic from accessing their sites, either in concern of resource usage, or because they do not wish their content being downloaded and processed by those services. Robots.txt (when the bot respects it of course, it’s entirely optional) gives a easy way to either say to a bot to avoid areas, or to not even try.

The question then comes up with “Why not crawl anyway?”. Normally giving site owners a way to disable your bot’s traffic through robots.txt is a better way to crawl than to either have site owners drop all packets from your servers or your system being blocked on a larger system (a service like say CloudFlare/Incapsula/Akamai Kona) or web host due to complaints.

The average robots.txt file

Before we dive into the data run I have done, let’s look at some of the example sites so that we get an idea of what a very detailed robots.txt looks like, and what they normally are like:

$ curl http://xkcd.com/robots.txt;echo
User-agent: *
Disallow: /personal/

Generally people use robots.txt to just blacklist stuff away, rather than tell search engines where they can go (this makes sense because it is not uncommon for crawlers to assume they can crawl whatever they want if there isn’t anything to tell them that they can’t).

In XKCD’s case, it looks like Randall just wants to ensure that search engines do not bother crawling any path that starts with /personal/.

$ curl http://a-n.co.uk/robots.txt
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Most systems automatically setup a bunch of things to automatically blacklist their own admin directories, to avoid search engines stumbling on admin or login pages (like above).

Aggregate status

I used my crawler with the Alexa list of top 1 million sites (zip file) to grab the robots.txt file of each one of those sites. After that I processed them into a SQL table that I could do queries on. I found that 675,621 sites (67.5%) have robots.txt files.

Unsurprisingly in files that are designed to block traffic, there are much more cases of people disallowing traffic rather than allowing it:

+-------+----------+
| allow | Count    |
+-------+----------+
|     0 | 10925614 |
|     1 |   616947 |
+-------+----------+

The majority of rules are directed at all search engines:

+-----------------------+---------+
| ua                    | Amount  |
+-----------------------+---------+
| *                     | 9331170 |
| googlebot             |  337979 |
| yandex                |  310900 |
| Global                |  105807 |
| msnbot                |   62109 |
| slurp                 |   55242 |
| bingbot               |   50903 |
| mediapartners-google  |   49907 |
| adsbot-google         |   41621 |
| googlebot-image       |   39319 |
| googlebot-mobile      |   36409 |
| baiduspider           |   28116 |
| ia_archiver           |   25225 |
| ahrefsbot             |   16542 |
| google                |   14753 |
| mediapartners-google* |   14508 |
| mj12bot               |   12582 |
| mediapartners -google |   12433 |
| yahoo pipes 1.0       |   11481 |
| teoma                 |    9173 |
| mail.ru               |    8847 |
| yandexbot             |    7498 |
| psbot                 |    7261 |
| yahoo! slurp          |    7095 |
| wget                  |    7090 |
+-----------------------+---------+

A graph with the Global (not directed at anything, not even *) removed:

Interesting to see that of 2% of the User-Agent targeted rules are for the Internet Archive.

There are also some really big robots.txt files out there. Top one being a Iranian bookstore (or at least that’s what it looks like)

+-------------------+--------+
| Domain            | Count  |
+-------------------+--------+
| icnc.ir           | 436961 |
| castorama.fr      | 210840 |
| protegez-vous.ca  | 171282 |
| tik-tak.co.il     | 131610 |
| riigiteataja.ee   | 103608 |
| enetural.com      |  95332 |
| fameonme.de       |  35857 |
| takdin.co.il      |  35705 |
| dreamitalive.com  |  34383 |
| norfolk.police.uk |  33689 |
+-------------------+--------+

Most of these are either blacklisting a huge amount of URLs (dynamically generated URLs), or trying to use robots.txt as a sitemap, something that robots.txt is not designed for, nor will work for in the case of major crawlers.

Finally here are the top URLs that are blacklisted from being crawled, unsurprisingly Wordpress finds its way in here quite a lot:

+------------------+--------+
| path             | Amount |
+------------------+--------+
| /                | 826666 |
| /wp-admin/       | 156793 |
| /includes/       |  64132 |
| /admin/          |  63442 |
| /modules/        |  53374 |
| /cgi-bin/        |  51728 |
| /search/         |  40754 |
| /images/         |  35790 |
| /wp-includes/    |  34356 |
| /cache/          |  33734 |
| /search          |  33530 |
| /xmlrpc.php      |  33263 |
| /tmp/            |  31262 |
| /templates/      |  31048 |
| /scripts/        |  30694 |
| /cron.php        |  29663 |
| /language/       |  29340 |
| /license.txt     |  29253 |
| /plugins/        |  28365 |
| /install.php     |  28000 |
| /components/     |  27611 |
| /administrator/  |  26962 |
| /themes/         |  26067 |
| /media/          |  25864 |
+------------------+--------+

The strange ones

Amusingly Twitter’s t.co bot can crawl itself, but no one else can:

User-agent: twitterbot
Disallow:

User-agent: *
Disallow: /

There are also a large amount of very strange user agents that are being blocked:

backdoorbot/1.0 appears quite a lot. Indeed a large amount of robots.txt mention this bot (4000~ of them), even though there is zero information that I could find out about it. Possible case of “well he has it, I should have it”?

Also interesting that at the time of this writing it has been less than 120 days since Applebot was publicly announced, and 75 sites have 586~ rules for it (list here).

Two sites have their robots.txt reference user-agent “undefined”. Possible type casting error while generating robots.txt files? Or perhaps a bot called undefined? (One is whitelisting it, one is blacklisting, I’m going to use Hanlon’s razor here to assume it is just a type casting error.)

Quite a lot of sites also block UA’s to do with Zeus. Now I cannot tell if this is the Zeus malware, but if it is they are not going to have any benefit by blacking the whole UA string (list here).

Common mistakes

As mentioned above, there’s a very large set of people who try and block an exact user agent rather than the one that the bot would be searching for (even if it was looking for it). Some people even try and blacklist versions of IE (for example mozilla/4.0 (compatible; msie 4.0; windows 2000)) from their sites using robots.txt, clearly in vain.

Because robots.txt is optional if someone is going out with the goal of doing something nasty, then they are most likely not going to read your robots.txt and stop crawling your site.

Many of these problems come up when something has not actually been defined too well, and different crawlers do not do the same thing, nor is there any real places that say things like “Please don’t put your site map in your robots.txt”.

Data to download

I’ve put up the MySQL database here: http://cache.benjojo.co.uk/DataStore/blogpoststuff/domrobots.sql.bz2 If you want to actually read or do more infomation mining on the files themselfs, then you can find the tar.gz of 675k robots.txt files here: http://cache.benjojo.co.uk/DataStore/blogpoststuff/robots.txt.tar.bz2