Crawlers Bias Server Log Stats

botlogSearch engines are regularly visiting web sites to update their index. The shorter the time between two visits the more accurate is the index.

Of course, there are side effects. On one hand this increases the load and the traffic of a web server. Thus, most modern crawlers are not too aggressive to avoid too high loads or network congestion. On the other hand web servers log all requests to log files. Thus, the log data interleaves between real users’ access and such of web crawlers. If the log files are used to create access statistics, they will be biased by the log entries of the bots.

Web Log Measures

I studied the log files of four different sites. This is a first result and was derived of approx. 1.5 million entries. But first let’s clarify a few terms used in web log analysis. These logs can be used to find a lot of information. The most used terms are “hit count”, “page count”, and “visits”.

The hit count is one of the oldest values and is rather useless today. It simply counts the number of log entries in a specific amount of time. This is the number of requests per time frame. It is useless because it heavily depends on the content of a page. If a page contains for example a lot of pictures it would create a higher hit count even if just a single page was viewed.

The page count is more significant because it ignores the elements within a page (such as images). While being an interesting measure for how many pages have been viewed, it cannot be used within this context of a comparison between bot access and real user access. The reason is that a bot typically follows all links within a web site which in contrast a typical user does not. A human would read just one or a few articles but never all.

Thus, the visit count is used in this case. A visit is a continuous stream of page requests from a single source. Thus, if either a user views three pages one after the other or if a bot accesses all pages on a site one after the other, each case is counted as just a single visit.

Result Stats

The figure from above shows the percentage of real user visits (blue) compared to crawler visits (red) of four different web sites. The purple shades in between show the variation within the observation period. 52 to 83 % of visits are from real users which in turn means that 17 to 48 % of the visits are done by bots. This is a pretty high number!

Separating Bot Requests

Thus, to get an unbiased result of a web log analysis the bot access should definitely be taken into account. The easiest way to distinguish between bots and users is to match on the User-Agent string. On the Apache web server this can easily be done by using the SetEnvIf statement. First, put the following statement into the configuration:

SetEnvIf User-agent \
"(AhrefsBot|YandexBot|Sosospider|Ezooms|Googlebot|msnbot|Spider|crawl|slurp|Jeeves|Mediapartners|FeedBurner)" \
bot-req

This creates an environment variable containing a regular expression to match on the User-Agent string which is found in the HTTP request header. As you can see it simply contains a list of well-known and not so well-know bots. Of course, you can simply modify this to fit your needs.

Now it has to be applied to the log statements of the site. This looks like this.

CustomLog /var/log/apache22/http-access_botlog combined env=bot-req
CustomLog /var/log/apache22/http-access_log combined env=!bot-req

Of course, the separation of bot entries and user entries based on the User-Agent string could also be done afterwards on the command line with grep or perl.

IP vs. IPv6

As an additional result of this analysis I did an IP vs. IPv6 comparison. That’s possible because these servers are IPv6-attached as well. 4.1 % of all real user visits and “even” 5.4 % of the bot visits are done from within IPv6.

Have fun watching your log files!