Amithalal Caldera and Yogesh Deshpande
Web Usage Mining (WUM) is the discovery of interesting knowledge from Web server logs. The access log files of a Web server contain a lot of details about users’ on-site behaviour. The validity of WUM depends on the accurate identification of user sessions implicitly recorded in these logs. In some applications, a user may be explicitly identified through user authentication. However, in general, the Web logs do not contain a user id and separate user sessions have to be inferred through heuristics. This is generally difficult because of several additional factors, such as Web caching, the existence of proxy servers and the stateless service model of the HTTP protocol. Several heuristics exist to address these problems. By definition, the heuristics yield inexact and variable results. It is, therefore, crucial to analyse and understand how good a particular heuristic is likely to be in a given environment. This paper reports on an investigation into the performance of a composite heuristic based on three published heuristics found in literature to identify sessions from the Web logs. We use the logs of a university Web server that records user ids for administrative reasons, which allows us to evaluate the heuristics against the concrete knowledge of user sessions. Consequently, the paper also proposes a strategy for future log analyses and makes recommendations for further work.