Monday, March 12, 2007

A New Visualization for Web Server Logs

There are well over a hundred web server log analyzers (Google Directory for Log Analysis) or web statistics tools ranging from commercial offerings such as WebTrends to open source ones such as AWStats. These take web server logfiles and display numbers such as page views, visits, and visitors, as well as graphs over various time ranges. This article presents the same data in those logfiles in a very different way: as a 3D plot. By the end of this article, I hope you will agree with me that the visualization described herein is a novel and useful way to view the content of logfiles.

The logfiles of web servers record information on each HTTP request they receive, such as the time, the sender's IP address, the request URL, and the status code. The items in each request are fairly orthogonal to one another. The IP address of a client has no relation to the URL that it requests, nor does the status code of the request to the time of the request. If that is the case, what could be a better way to display these n columns from the logfiles than an n-dimensional plot?

When an administrator observes anomalous behavior on a web server, she reaches out for web statistics reports, as they are usually all there is as a record of past activity. These often prove fruitless, mainly because web statistics is primarily a marketing-oriented view of web server activity. The next step is to take the raw logfiles apart with ad hoc scripts. The sheer mass of data makes it difficult to reduce it to a few numbers that reveal the cause of the problem. Another complication is that you may not quite know what you are looking for other than that it is abnormal behavior. The path this article takes is to provide a visualization of raw data such that cause or causes make themselves visible. This comes from the real-life experience of a client, where crippling performance problems appeared out of nowhere.

The Plot

The scatter plot in Figure 1 shows more than half a million HTTP page requests (each request is a dot) in 3D space. The axes are:

  • X, the time axis--a full day from midnight to midnight of November 16.
  • Y, the requester's IP address, with the conventional dotted decimal format sorted and given an ordinal number between 1 and 120,000, representing the number of clients that accessed the web server.
  • Z, the URL (or content) sorted by popularity. Of the approximately 60,000 distinct pages on the site, the most popular URLs are near the zero point of the Z-axis and the least popular ones at the top.

3D scatter plot of a good day
Figure 1. Scatter plot showing HTTP requests

If the plotted parameters were truly orthogonal, you could expect a random distribution: a flat featureless plot. The parameters, however, are not completely independent of one another. For example, the IP ranges for Italy may prefer the Italian pages on the website. Therefore instead of a random plot, there are clusters in the 3D space. If you think about it, that does not seem unreasonable: the home page is probably the most visited page on a website. Studies (especially Jakob Nielsen on website popularity and Jakob Nielsen on traffic log patterns) argue convincingly that popularity closely follows Zipf's law: a log curve with a long tail. Hence the dense horizontal layer at the bottom in Figure 1. The vertical rectangular planes are search crawlers. They request pages over the whole content space from a small number of IP addresses and do that over the whole day. Therefore, clustering along each of the three dimensions is common.

The Case Study

The website of a client grew inexplicably sluggish one day. Since the web server, CMS, and auxiliary servers had run well for the preceding months, the only rational explanation pointed to an unusual request pattern. The web log-analysis reports showed nothing out of the ordinary. Command-line scripts (with awk, sort, grep, and friends) running over the logfiles also revealed no anomalies. I used Gnuplot to graph the requests in 3D space. (See also an excellent Gnuplot introduction) Some time later, the 3D plot made the culprit evident.

3D scatter plot of a bad day
Figure 2. Scatter plot of a bad day.

The thick pillar in the plot stands out like a sore thumb. This is a dense set of requests in a short time (about 100 minutes on the X-axis, which represents 24 hours) from a single IP address (Y-axis) and going over the whole content space (Z-axis). Why should it cause trouble? Large-scale CMS servers generate content on-the-fly from a database. Caches usually handle most requests, so only the small number of requests that are not currently in the cache should require database activity. On this particular CMS, the caches keep content for 15 minutes. When the client requested all of the pages in a short time, the high number of cache misses placed a heavy load on the database. This resulted in deteriorated performance. Search crawlers such as Yahoo Slurp and Googlebot do pretty much the same thing, but they spread the load over a much longer period.

The Process

Now that you have seen the output, here's how to generate it. The input is, of course, an access logfile that has lines of data, one per HTTP request. A typical line from an Apache server conforms to the NCSA combined access logfile standard. (See the Combined Log Format description at apache.org.) Note that I've wrapped the long line:

83.214.62.64 - - [15/Jan/2006:21:12:29 +0100] "GET
/index.php?level=2 HTTP/1.1" 200 5854 "http://www.sphere.li/index.php"
"Mozilla/5.0 (X11; U; Linux i6 86; en-US; rv:1.7.3)
Gecko/20040914"

The Perl script at the end of the article takes sequences of these lines and condenses them to just what Gnuplot needs. Run it with an access logfile and redirect it to an output file, such as gnuplot.input, from the command line:

$ perl prepare-for-gnuplot.pl access_log > gnuplot.input

The output will be a series of lines matching those of the access logfile. For the previous line from the access log, the corresponding output is:

15/Jan/2006:21:12:29 906 41 200

The fields in gnuplot.input, the output file of the Perl script, are date/time, ip rank (906), url rank (41), and status code.

To display the sequence of lines in Gnuplot, give it the commands:

$ gnuplot
set style data dots
set xdata time
set timefmt "%d/%b/%Y:%H:%M:%S"
set zlabel "Content"
set ylabel "IP address"
splot "gnuplot.input" using 1:2:3

Variations



If the plot is too dense--as was the case for me--thin it down by
telling Gnuplot to only use every nth data point. For example, I
thinned Figure 1 by plotting every tenth point with the Gnuplot splot command:



splot "gnuplot.input" using 1:2:3 every 10


Figure 3 shows the corresponding scatter plot.



Thinned 3D scatter plot of a good day

Figure 3. Thinned scatter plot



Gnuplot makes it easy to focus on a part of the plot by setting the
axes ranges. Figure 4 shows a small part of the Y- and Z-axes. The
almost continuous lines that run parallel to the time axis are
monitoring probes that regularly request the same page. Four of them
should be clearly visible. In addition, I changed the eye position.



Monitoring probes visible after reducing the Y and Z ranges.

Figure 4. Reduced Y and Z ranges showing monitoring probes



Because real people need sleep, it should be possible to make out
the diurnal rhythms that rule our lives. This is evident in Figure 4.
The requests are denser from 08:00 to about 17:00 and quite sparse in
the early hours of the morning.



Changing the viewing angle can give you a new point of view. Gnuplot lets you do it in one of two ways: with the command line set view or interactively with a click and drag of the mouse.



The Pièce de Résistance



Because a display of 3D plots is difficult to see in three
dimensions without stereoscopic glasses, I used a few more
manipulations to "jitter" the image such that the depth in the picture
is visible. The plot in Figure 5 is an example of this. It was easy to
generate with more Gnuplot commands followed by GIF animation with ImageMagick.



An animated scatter plot

Figure 5. A animated GIF of the scatter plot that hints at the 3D structure



Further Work



With Gnuplot 4.2, which is still in beta, it is now possible to draw
scatter plots in glorious color. Initial tests show that using color
for the status code dimension makes the plots even more informative.
Stay tuned.



Conclusion



Though the 3D plots present no hard numbers or trend lines, the
scatter plot as described and illustrated above may give a more
intuitive view of web server requests. Especially when diagnosing
problems, this alternative way of presenting logfile data can be more
useful than the charts and reports of a standard log analyzer tool.



Code Listings



The Perl script:



#
# prepare-for-gnuplot.pl: convert access log files to gnuplot input
# Raju Varghese. 2007-02-03

use strict;

my $tempFilename = "/tmp/temp.dat";
my $ipListFilename = "/tmp/iplist.dat";
my $urlListFilename = "/tmp/urllist.dat";

my (%ipList, %urlList);

sub ip2int {
my ($ip) = @_;
my @ipOctet = split (/\./, $ip);
my $n = 0;
foreach (@ipOctet) {
$n = $n*256 + $_;
}
return $n;
}

# prepare temp file to store log lines temporarily
open (TEMP, ">$tempFilename");

# reads log lines from stdin or files specified on command line

while (<>) {
chomp;
my ($ip, undef, undef, $time, undef, undef, $url, undef) = split;
$time =~ s/\[//;
next if ($url =~ /(gif|jpg|png|js|css)$/);
print TEMP "$time $ip $url $sc\n";
$ipList{$ip}++;
$urlList{$url}++;
}

# process IP addresses

my @sortedIpList = sort {ip2int($a) <=> ip2int($b)} keys %ipList;
my $n = 0;
open (IPLIST, ">$ipListFilename");
foreach (@sortedIpList) {
++$n;
print IPLIST "$n $ipList{$_} $_\n";
$ipList{$_} = $n;
}
close (IPLIST);

# process URLs

my @sortedUrlList = sort {$urlList {$b} <=> $urlList {$a}} keys %urlList;
$n = 0;
open (URLLIST, ">$urlListFilename");
foreach (@sortedUrlList) {
++$n;
print URLLIST "$n $urlList{$_} $_\n";
$urlList{$_} = $n;
}
close (URLLIST);

close (TEMP); open (TEMP, $tempFilename);
while () {
chomp;
my ($time, $ip, $url, $sc) = split;
print "$time $ipList{$ip} $urlList{$url} $sc\n";
}
close (TEMP);

http://www.oreillynet.com/pub/a/sysadmin/2007/02/02/3d-logfile-visualization.html