Thursday, December 7, 2006

IBM WebSphere Developer Technical Journal: Using portal analytics with open-source reporting tools

Introduction

What is portal analytics?

Successful organizations spend a large amount of time planning and developing their initial portal release. Although this work is critical, it covers only a portion of the total planning effort required over the lifetime of a portal. You also need to maintain, monitor, and adapt your portal to new usage patterns that surface only after going live.

When planning a portal project, groups usually do sizings based on assumptions, experience, and expectations. Over time, other questions arise, such as "Will our portal be able to deal with evolving user needs?" and "What do our users really do with the portal?" These questions and others can be answered using portal analytics, which is the process of collecting, processing, and reporting usage data.

WebSphere Portal writes usage records to a dedicated log file. Because the format of the log follows industry standards ("NCSA Combined"), you can integrate portal usage data with your preferred reporting and analytics tools. Portal analytics include trend analysis techniques that can help you to predict the demand on your portal in the future. Therefore, a portal operator can proactively plan for adapting to the community's needs, instead of being hit without warning after a threshold is reached.

WebSphere Portal provides comprehensive instrumentation capabilities. In this article, you will see how reports and analytics information can be derived based on the data provided by the instrumentation. The example involves end-to-end reporting for typical statistics reports and shows how to use the logs for portal analytics using open source reporting tools.

Why care about portal usage?

A portal usually serves one or more specific purposes. For example, it can facilitate access to information, it might enable its users to work more effectively and more efficiently, and it might also be used to integrate technically separate information systems "on the glass". Organizations typically have a very clear picture of what purpose a portal should serve. Measuring the success of those efforts is critical. Knowing how well the portal supports its users also helps your group justify the investment in the portal.

What the logs report

WebSphere Portal logs the following user activities and makes them available for analytics:

* Page management (creating, reading, updating, deleting a page)
* Requests of a certain page by users (including contained portlets)
* Session activities (login, logout, timed out, login failed)
* User management actions (creating, reading, updating, deleting users and groups).

For more information on what data WebSphere Portal logs can create, view the site analysis topic in the WebSphere Portal Information Center (see Resources).

Elements of portal analytics

In a recent report from the Patricia Seybold Group, the following criteria for portal analytics was defined:

Instrumentation

Portal technology platforms should collect the information that represents their usage and performance.

Real-time analysis

You need to know how well many aspects of your customer portal are performing in real time. You would prefer not to (or can't wait to) move performance data to a data warehouse and run queries and analytics against it.

Reports

Portal technology platforms should package reports that present how customers are interacting with their content and how well their facilities are performing. These reports should present instrumented information in easily-understood formats.

Analytics

Sometimes reports aren't enough, and analytic processing is required in order to understand customer behaviour and customer portal performance.

This article focuses on instrumentation, and briefly discusses how reports can be derived from the data that is created by WebSphere Portal.

The long-term vision for portal analytics is a self-optimizing portal that automatically responds to the demand on the site, including both functional and non-functional aspects. If the portal site was referenced by a popular news service (the "Slashdot effect") and the access rate rose high above normal, you could enable spare servers to be automatically deployed with the current configuration and to be added to the existing cluster. After the demand slows down again, the spare servers are freed up for other purposes.

A functional example would be that a portal's navigation only shows the most-wanted pages upon the user's arrival. As the user digs deeper into the site, more navigation elements display.

What portal analytics is not

Portal analytics instrumentation is not a replacement for other logs such as audit, performance, or system event logs. The audiences for those kinds of logs are different:

  • Audit logging is usually used in security-sensitive environments where changes made to the portal's run time configuration are recorded. Auditing is primarily part of the administrative function in WebSphere Portal. Portal analytics, on the other hand, focuses on that part of the portal that end users see as well as how they use the portal.
  • Performance logs are used to find out how well a particular part of the whole portal performs in the real production environment. It enables the operator to determine to what extent the portal consumes valuable resources (such as CPU or memory) while it adheres to the defined Service Level Agreements (SLAs).
  • System event logs help the system administrator to understand what issues occur while running the portal. Typically, records in the system event log contain Java™ exception traces that enable an administrator to take appropriate action.

Alternative solutions

Not covered in this article are techniques that use external log services like IBM SurfAid (now owned by Coremetrics) or Google Analytics. These services usually trace user activities by placing a certain piece of content (typically some inline JavaScript or a remote image) that is pointing to a service provider in the page that is delivered to the user. After rendering the page, the browser retrieves those content items from the service provider. The service provider analyzes its access logs and provides its customer with reports about what was requested and when. This technique is also referred to as a "Web beacon", "Web bug", or other similar names.

Using portal analytics

WebSphere Portal provides analytics logging by writing events to a dedicated log, similar to the logging for analyzing a server delivering static pages. The portal analytics file is called sa.log (sa for site analytics) and you can typically find it in the $WP_HOME/log/ directory.

Each line in sa.log represents a specific event that was fired through a request against the portal. A single request (for example, for a certain page) can result in multiple lines written to sa.log. You can customize the type and amount of logging by configuring the appropriate logger.

WebSphere Portal defines the following site analytics loggers:
Logger Purpose
SiteAnalyzerSessionLogger Logs session events like login or logout
SiteAnalyzerUserManagementLogger Logs user and group management events like creating or deleting users and groups
SiteAnalyzerPageLogger Logs page render events
SiteAnalyzerPortletLogger Logs portlet render events
SiteAnalyzerPortletActionLogger Logs actions occurred in a Portlet
SiteAnalyzerApplicationActionLogger Logs actions occurred in a Portlet application
SiteAnalyzerErrorLogger Logs any errors

For each interaction a user makes with the portal, the appropriate logger (if it is configured) creates a new entry in the site analytics log file. For each activity, the general format of the log record follows the definition described above. The difference is in the request URI, in which specific data is recorded for each activity.

For example, the page logger records the name of the page the user just requested, the name of the parent page (if it is a derived page), and some other data that is specific to the pages. Likewise, the session logger creates a record whenever a user logs in or out of the portal. In the case of a log-out, the session logger logs the reason for the logout (for example, timed out) as the URI parameter.

The specific data for each logger goes into the request URI, either as path information or as URI parameters.

Using the loggers to record events

Let's look at the loggers in more detail, and examine the format of the associated request URI.

SiteAnalyzerSessionLogger is responsible for recording login and logout user activity. When a user logs into the portal, the request URI is /Command/Login and the user ID is logged in the USER_ID part of the log record. If the login attempt fails, a query string of ErrorCode=x (where x is the error code value) is appended to the request URI. The status code is set to the appropriate HTTP status code.

After the user logs out, whether explicitly or through a timeout, a log record with a request URI of /Command/Logout is created. If the reason for the logout is a session timeout, a Reason=SessionTimedOut query string is appended to the request URI.

Logging user management events

You can use the SiteAnalyzerUserManagementLogger to log any user management events that are made through the portal's administrative interface. When a new user is created, the logger records the event in the request URI as /Command/UserManagement/CreateUser. Deleting a user results in a request URI of /Command/UserManagement/DeleteUser.

You use the same mechanism for recording events associated with creating, modifying, and deleting groups. The respective URIs are /Command/UserManagement/CreateGroup, /Command/UserManagement/ModifyGroup, and /Command/UserManagement/DeleteGroup.

No further detail about the user or group that was subject to an operation is currently logged. Important: Logging for Modify events is not available in releases before WebSphere Portal V6.

Logging page events

Whenever a page is displayed to a requester, SiteAnalyzerPageLogger creates a corresponding log entry. The request URI starts with /Page/, the unique object ID and the name of the page follows.

SiteAnalyzerPageLogger also records any page management. The request URI is either /Command/Customizer/CreatePage, /Command/Customizer/EditPage, or /Command/Customizer/DeletePage. The unique object ID and the name of the managed page are appended to the request URI as parameters of the query string. For example:

/Command/Customizer/EditPage?Page=6_0_5RH_[CONTENT_NODE:6001]&PageName=Products

In this example, a page whose object ID is 6_0_5RH and whose name is Products was edited.

Logging portlet events

If the SiteAnalyzerPortletLogger is configured, it creates a log record for each portlet that is rendered, no matter which page contains the portlet.

The request URI of the log record starts with /Portlet/. The unique object ID of the portlet being rendered is appended, as well as the portlet name. The query string contains the unique object ID of the portlet, along with the current portlet mode {View, Edit, Configure, Help} and state {Normal, Minimized, Maximized}.

If both SiteAnalyzerPageLogger and SiteAnalyzerPortletLogger are turned on, rendering a single page can lead to multiple log records in the sa.log (one line for the page and one line for each portlet on that page).

Logging portlet actions

SiteAnalyzerPortletActionLogger is not invoked by the general event infrastructure in WebSphere Portal. It writes records when you invoke it manually, and then it writes the request URI starting with /PortletAction/.

Logging errors

If an error occurs while rendering a page or portlet, the SiteAnalyzerErrorLogger creates a corresponding record in the analytics log. This is not a replacement for the system event logs (wps_*.log); instead, it lets you record errors from a business perspective, such as the number of failing portlets within a certain timeframe.

If there are errors to report, the corresponding log records start with /Error/Portlet or /Error/Page.

Application action logger

This logger is reserved for future use. The intention is to enable portlets to contribute to the site analytics log. However, at the time of writing this paper, using this logger is not yet supported.

If the logger writes anything, the request URI will start with /ApplicationAction/.

Examining the log file format

The format of the log records follows the NCSA Combined log format. Using the Extended Backus-Naur Form (EBNF), it can be formally specified as follows:

STRING = ? any printable character except the quote sign ?;
HOST, CLIENT_ID, USER_ID = STRING;
HYPHEN = "-";
QUOTE = """;
SIGN = "+" | "-";
SPACE = " ";
TWODIGITHOURS = "00".."23";
MINUTES = "0..59";
DAY = "00..31";
MONTH = "01..12";
SLASH = "/";
COLON = ":";
RFC822TIMEZONE = SIGN SPACE TWODIGITHOURS SPACE MINUTES;
TIMESTAMP = DAY SLASH MONTH SLASH YEAR COLON HOURS COLON MINUTES COLON SECONDS SPACE
RFC822TIMEZONE;
PORT = 0..65535;
URI = "http" | "https" + COLON + SLASH + SLASH + HOST + PORT + STRING;
STATUS_CODE = NUMBER ? must be a legal HTTP status code ?;
BYTES = NUMBER;
REFERER = URI;
REQUEST = "GET" | "POST" | "PUT" | "DELETE" URI SPACE "HTTP/1.0" | "HTTP/1.1";
SA_LINE = HOST SPACE CLIENT-ID|HYPHEN SPACE USER_ID|HYPHEN SPACE "[" TIMESTAMP"]"
SPACE QUOTE REQUEST_URI QUOTE STATUS_CODE SPACE BYTES SPACE REFERER SPACE QUOTE
USER_AGENT QUOTE SPACE QUOTE COOKIES QUOTE;



Tip:
The BYTES value is usually -1, meaning that the size of the returned markup is unknown, because of the dynamic nature of a portal page.

The request URI is artificial and cannot be called from a browser. Its only purpose is to carry the relevant logging information in an NCSA-common-compatible way. The URI is also independent of changes you make to your site's structure and content. You can use page names or page IDs to structure your reports in a way that suits your needs.

Correlating information

While you probably want to know which pages are requested by users, you might also want to understand the relationships among requests. Because HTTP is a stateless protocol, there is no inherent information about the page a user looked at previously before he or she looks at another one. This problem is typically solved by employing a user session, and this is what WebSphere Portal and the underlying WebSphere Application Server use.

WebSphere Portal tracks sessions using cookies, and the default cookie is typically called JSESSIONID. Whenever a user logs in, WebSphere Portal creates a new session and stores the key to the session in the browser as the cookie's value. From then on, the browser sends the cookie with each request to the server the cookie came from. By reading the cookie value, the server can correlate a specific request with a specific session and with previous requests.

The user's session can also be used to find related requests in the site analytics log. The requests can be grouped by session to gather additional information. For example, to find the most common click trails through a portal, you could group all requests by their session and then derive the click trail for each session. By counting the number of similar trails, you could determine the "most used" click trails through that portal.

As an alternative correlation mechanism, you could send a dedicated "tracing" cookie to the user's browser. The cookie can write to the sa.log in the same manner as any other cookie. The portlet programmer can choose the name and value of the cookie, and then log arbitrary data to the sa.log without imposing much overhead.

Examples

Logging page requests

When a user requests a page (and PageLogger is turned on), WebSphere Portal creates a log entry for the page:

localhost - wpsadmin [15/Jun/2006:23:42:10 +0200] "GET /Page/6_0_4D_[CONTENT_NODE:
141]/Welcome HTTP/1.1" 200 -1 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.8.0.4) Gecko/20060508 (CK-IBM) Firefox/1.5.0.4" "JSESSIONID=
0000c9fiQ6Q14XUbsp3QFQHFkq9:-1"


You can derive the following data from the log entry:

  • The request was made from localhost.
  • The authenticated user for this request was wpsadmin (short name of the user).
  • Handling the request was finished at 15/Jun/2006:23:42:10, GMT +0200.
  • The page with title "Welcome" was requested.
  • The request was successful (HTTP response code 200).
  • The size of the returned markup is unknown to the logger (-1).
  • The request was made by a browser which identifies itself as "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4) Gecko/20060508 (CK-IBM) Firefox/1.5.0.4", which is actually Firefox 1.5.0.4 on Windows® XP.
  • The request was made within a session, identified by the session id "0000c9fiQ6Q14 XUbsp3QFQHFkq9".

Logging page and portlet requests

When a user requests a page and PageLogger and PortletLogger are both turned on, log entries are created for the page and for each portlet on that page. The first log entry and the information it contains will be similar to the one mentioned above:

localhost - wpsadmin [15/Jun/2006:23:42:10 +0200] "GET /Page/6_0_4D_[CONTENT_NODE:
141]/Welcome HTTP/1.1" 200 -1 "" "Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.8.0.4) Gecko/20060508 (CK-IBM) Firefox/1.5.0.4" "JSESSIONID=
0000c9fiQ6Q14XUbsp3QFQHFkq9:-1"


Because PortletLogger is now turned on, an additional entry is created for each portlet that resides on the page; for example:

localhost - wpsadmin [15/Jun/2006:23:42:16 +0200] "GET /Portlet/5_0_49_[PORTLET_
ENTITY:137]/Welcome_to_WebSphere_Portal?PortletPID=5_0_49_[PORTLET_ENTITY:137]&Portlet
Mode=View&PortletState=Normal HTTP/1.1" 200 -1 "http://localhost/Page/6_0_4D_[CONTENT_
NODE:141]/Welcome" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.4)
Gecko/20060508 (CK-IBM) Firefox/1.5.0.4" "JSESSIONID=0000c9fiQ6Q14XUbsp3QFQHFkq9:-1"


The additional information in this line is:

* A portlet with the title "Welcome to WebSphere Portal" was rendered (/Portlet/ 5_0_49_[PORTLET_ENTITY:137]/Welcome_to_WebSphere_Portal).
* It was rendered in View mode and its state was Normal (neither minimized, maximized nor solo - PortletMode=View&PortletState=Normal).
* It is located on the page with name "Welcome" (referrer, http://localhost/Page/6_0_4D_[CONTENT_NODE:141]/Welcome).

Logging which portlet was requested seems redundant at first; the portal's static configuration should tell exactly which portlets are deployed to a certain page. However, every user who is in an Editor role on a page can remove and replace portlets at will. Logging both page and portlets lets you determine which portlets a user requested.

Similarly, PortletLogger records the page that includes the portlet in the referrer field. If PageLogger is turned off, this field lets you determine which page rendered the portlet.

Information that can be derived from the log

Based on this data, the log analytics package can create reports such as:

  • Hosts and domains of the users visiting the portal (by counting and aggregating the HOST elements).
  • Different logins corresponding to authenticated users (by counting and aggregating the USER_ID elements).
  • Robots and browser details (by counting and aggregating the USER_AGENT elements).
  • Pages that were requested but not found (STATUS_CODE).
  • Search engines, key phrases and key words (REFERER).
  • Different operating systems as reported by the browser (USER_AGENT).
  • The referring page a user came from (REFERER).
  • Entry and exit URLs (the most common first and last pages requested within unique sessions).

There are many more potential reports and aggregations than listed above. Information such as common click trails of users, or the response in relation to a marketing campaign, start to touch the area of data warehousing. This information is typically a subject for larger, distinct projects. Simple log reporting and analytics cannot answer those advanced questions.

Reporting with open source tools

There is a wide range of commercial and open source site analysis and reporting tools available. The example report below uses the popular Open Source AWStats package.

Introducing the example

The example creates one simple report called This Month's Top Pages. This report should provide a graph of the pages that have been requested within a particular time period. It will show the name of the page and the access count (number of times that page has been requested).

Turning on the instrumentation

Start by enabling the relevant instrumentation in WebSphere Portal:

Enable logger by setting the following values in the WebSphere Application Server admin console.

SiteAnalyzerPageLogger.isLogging=true
SiteAnalyzerPortletLogger.isLogging=true

Usually the settings are already there; they just need to be set to the new values. (For the details of this procedure, see the Setting configuration properties topic in the Portal WebSphere V6 Information Center.)

Restart WebSphere Portal.

Make a few requests against the portal and then check that sa.log contains new records.

Installing and configuration AWStats

AWStats comes with extensive information about installing it. For the sake of our example, we will only use the offline analysis. Therefore, it is enough to:

1. Download AWStats.
2. Unzip the package into a directory.
3. Configure a site according to the documentation. For example, create awstats.localhost.conf (with localhost being the site name) and change the following configuration statements:

LogFile="C:\Program Files\IBM\Portal51UTE\PortalServer\log\sa.log"
LogFormat=1
LogSeparator=" "
SiteDomain="localhost"
DNSLookup=1
AllowToUpdateStatsFromBrowser=0


4. Create the initial statistics database:

perl awstats.pl -config=localhost -update
5. Create the overview report as shown in Figure 1:

perl awstats.pl -config=localhost -output -staticlinks > awstats.localhost.html

Testing the report

To see how different reports relate to different request schemes, you create a suite of test pages, test users, and test requests. Then you can see how AWStats analyzes and reports different request schemes.

Creating a test hierarchy of pages

First, set up a small, automated test that makes a number of requests against the portal. The report created from those calls should show exactly the requests we made.

To create the requests for this example, you use Apache's JMeter tool. To keep things simple, create a hierarchy of predefined portal pages to be called with JMeter.

Again, to keep the test simple, you can derive each page from a common template page. The download of the supporting files for this article contains an XMLAccess file (createPortalAnalysisTree.xml) that you can use to create the complete sample hierarchy.

Assign a URI mapping context to each page. This will ease the setup of JMeter quite a bit.

Creating a group of test users

In addition to showing requests to different pages, the example report should also show requests by different, authenticated users. XMLAccess supports the creation of users and groups as long as the user registry connection is configured for read/write access. The code archive in the download file contains an XMLAccess script (createPortalAnalyticsTestUsers.xml) that creates a number of test users and a group containing them.

The script that creates the sample page hierarchy assumes that group "Portal Analytics Test-Users" exists. The createPortalAnalyticsTestUsers.xml script creates that group, too.

Installing and configuring JMeter

JMeter ‎is a simple load test tool in Apache's Jakarta project. It's a pure Java tool that tests server performance using various protocols. In this example, we use its HTTP connection to create a series of requests against the portal. Then, we can use a (manual) correlation among the requests made by JMeter and the results logged by portal analytics to sa.log.

The examples in this article were created using JMeter 2.2, which you can download. After you have met all the prerequisites, you can start it from the bin subdirectory in the JMeter directory. The tool's main window will look similar to Figure

The initial configuration primarily follows JMeter's user manual. You build the test plan to login with one of our test users (analyticsTest001 .. analyticsTest009), and then request the homepage. Without think time, the following requests ask for the product, services, support, and account pages. The iteration stops without logging out.

Without going into too much detail regarding how we configured JMeter, here are the main config items that are used to conduct the test:

  • Thread Group "Analytics Test Users" runs with 10 users (threads).
  • HTTP Request Defaults are set to point to http://localhost:9081.
  • An HTTP Cookie Manager is used to keep track of cookies, but clears cookies for each iteration.
  • An HTTP Header Manager sends a custom User-Agent header. You can use this technique in a production environment to distinguish artificial JMeter requests from real users.
  • Five HTTP requests follow:
1. Log in. We use the following well-known login URL and pass a randomly selected user ID.

/wps/portal/cxml/04_SD9ePMtCP1I800I_KydQvyHFUBADPmuQy?userid= &password=
2. /analytics/home
3. /analytics/products
4. /analytics/services
5. /analytics/support
6. /analytics/account
  • Finally, a View Results Tree shows the results of the test plan.
Running the test

Running the test is simple. After all pieces are in place (test users, page hierarchy, and test plan), you invoke the test by selecting Run => Start in the JMeter tool. The tool creates ten threads, each modelling an individual test user. Each user logs on to the portal and selects the defined sequence of pages.

Results

Once the plan execution stops, you can invoke the AWStats reports to update the reports (see Resources):

perl awstats.pl -config=localhost -output -staticlinks > awstats.localhost.html

The results, shown in Figure 5, indicate that the most popular page for our sample test plan is the home page. However, it shows 20 hits. Didn't we simulate just ten users with a single iteration each? This is really no surprise when you think about the way WebSphere Portal and J2EE™ work. The first request we made was the login request. If this is successful, WebSphere Portal displays the very first page on which a user has view rights. In this case, if there is nothing underneath "My Portal" besides our test hierarchy, the first page is the home page. But we request the home page again in our next call after login. Therefore, we get twice the number of hits for the home page.

The most popular portlet is the "Information Portlet". Again, no surprise here because we had only one portlet on our single page template.

In a real world scenario this report would, of course, look much more complex. However, the small and simple setup in this example gives you a clear understanding of how portal analytics work.

Setting configuration parameters in WebSphere Portal V5.1

Configuring WebSphere Portal V5.1 for portal analytics is slightly different from the procedure for WebSphere Portal V6. The main difference is that in WebSphere Portal V5.1, all configuration settings are made through property files, whereas WebSphere Portal V6 manages its configuration with the help of the Resource Environment Provider facility of WebSphere Application Server V6.

To enable analytics instrumentation in WebSphere Portal V5.1, turn on the logger in /shared/app/config/services/SiteAnalyzerLogService.properties by setting:

SiteAnalyzerPageLogger.isLogging=true
SiteAnalyzerPortletLogger.isLogging=true

Usually the lines are already there; they are just commented out. In this case it is enough to un-comment those lines. All other elements are similar to those for WebSphere Portal V6.
Conclusion

WebSphere Portal's analytics log provides all the necessary data for portal analytics. Although the recorded URLs are not real, clickable URLs, they still provide the relevant information to find out which pages the users of the portal looked at.

This paper explained the data collected by WebSphere Portal in the analytics log file, and showed how to use that data with the AWStats Open Source Web site analysis package.