Traces in the Clickstream: Early Work on a Management Information Repository at the University of Pennsylvania
For the past three years, the University of Pennsylvania (Penn) library has been building a data repository and developing computer functionality to support management information needs. This article traces the origin and evolution of Penn’s evolving management information system (MIS) program, known as the data farm. It addresses problems pertaining to the collection, storage, anonymization, and normalization of data, and looks at current work on a database-driven model for future MIS functions.
In recent years, the range of academic library services powered by relational databases and servers has increased substantially. Our customers access electronic journals and search article indexes on the Web. They wand barcodes to charge-out books, use debit cards to copy or print documents, swipe ID badges to enter library buildings . . . and the list goes on. These interactions with the library give rise to a flood of transaction data recorded by the databases and computers that drive service. The machines register when and where services are used and contain information about individuals and the resources people access. Occasionally they even capture clues about an individual’s work environment; the traces of a cookie or a referring URL can provide such indicators; the location of a photocopier may offer others. The powerful tools of library service can support equally powerful tools of library assessment. In the right organizational culture, those assessment tools can inform and thus improve planning and decision-making. They can serve less benign purposes as well, absent policies and procedures designed to safeguard identities.
Three years ago, motivated by the need to improve the measurement of electronic resource use, the Penn library began working with various kinds of transaction data, and in particular Web server logs, as sources of management information. Though narrowly defined in its early stage, this effort at log analysis is beginning to resemble the form and function of an MIS. The following is a brief survey of the embryonic MIS at the Penn library, an aggregation of data files and databases, form interfaces, and Web pages called the library data farm.
Initial Challenges
By 1998, the Penn library offered a large collection of online article indexes, full-text databases, and electronic journals through its Web site. As this quickly evolving, costly set of services expanded, frustrations mounted over the difficulty of measuring its use and impact. The lack of good management information, if only in the form of frequencies or other descriptive statistics, was viewed as a serious shortcoming, one that could hamper the library’s accountability to the university’s schools and impede planning and budgeting.
For several years, the library staff has worked diligently to compile the few statistics that vendors provide and trace the spiral of Web activity based on Web-site visits and page counts. But these attempts at measurement have had obvious drawbacks. Vendor statistics have been meager, erratic, poorly defined, and incompatible across products. Web measures have provided information about machine load, but have revealed nothing about resource use. Together, the external and internal sources have contributed little clarity to the picture of digital information use. In addition, the compilation of these crudest of measures has proved too time consuming and labor intensive to carry out with any regularity.
Given the absence of third-party solutions or working models, any approach to the management information problem would require a lot of experimentation with local data sources and systems. The approach would have to be independent of information providers and sufficiently robust to generate useful statistics with a modicum of labor. In short, it would have to provide a means of:
- increasing the resolution of library statistics, especially with regard to demographics and cost analysis;
- counting with reasonable accuracy and sustainable methods;
- reducing the level of effort involved in harvesting data and converting them into information; and
- improving the consistency and reliability of data collection and administration.
The key to meeting these objectives exists in the self-monitoring capabilities of the library management system (the Penn LMS is the Voyager product from Endeavor Information Systems) and other components of the information technology (IT) infrastructure, specifically standardized and configurable logging processes.
Prior to the data farm, the library viewed logs as a byproduct rather than an end of service, and thus afforded logging low operational priority. At Penn—and surely elsewhere—logs were often discarded or lost whenever servers ran short on storage, or machines were rebuilt or redeployed. Even seemingly trivial attempts to count the number of HTML pages served often fell short for want of data. If logs were to provide the basis for measurement, then logging processes would have to be routinized to sustain and ensure the availability of log inputs. The data farm traces its origin to these narrow objectives.
Initial Steps
In winter 2000, work on log analysis began in earnest, focusing first on logs generated by the servers supporting open and proxied Web access. Server logs are enormous. As a consequence, they have high administrative and storage demands. In their raw form, they contain only a small percentage of useful information, and require substantial refinement before analysis can take place. To address the handling and storage issues, staff devised Perl scripts that roll log data nightly from production servers into a two-hundred-gigabyte networked storage system established for assessment purposes. Other Perl modules parse the huge, nightly dumps, then extract and save the outputs into smaller, content-rich files. The networked storage array has three partitions: one with folders accepting the massive raw logs, another with folders for processed log extracts, and a third with folders for reports.
Improving Data Resolution
Resource Identifiers
The pieces of extracted content stored in processed folders include the resolved host names of incoming IP addresses, date and time stamps, and referring URLs. Each element provides a different key to sharpening the resolution of the usage data. The most important key is a certain referring URL designed to log each time a user requests a licensed resource, such as ABI/Inform. This counting referrer captures resource names before routing visitors to a vendor or another remote site considered statistically important. In this way, a channel was opened in the log stream for marking and later counting events that would otherwise evade our scrutiny.1
Demographic Information
The second key to higher resolution data involves identifying the whereabouts of visitors to the library Web site and harvesting demographic information from these location sources. Network IP addresses make it possible to trace all on-campus Web activity to offices, labs, and residential facilities associated with Penn’s twelve schools and other budget authorities. To demographically analyze the on-campus segment of traffic, library staff constructed a detailed map of host names derived from numerical IP addresses and the campus assignments database. All off-campus requests for licensed services require proxy authentication with Penn credentials, and library staff has recourse to good demographic information for this portion of Web activity by linking the Penn credential to school and status descriptors contained in the library’s patron database. As nightly log dumps from the main Web server and proxy server are processed, scripts match IP information to the campus IP map, and credentials to library patron records. The scripts then execute processes that replace IP data and Penn IDs with demographic markers for schools or campus centers. With this step, staff is able to attribute nearly all traffic involving licensed resources to a demographic source and anonymize library data files.
Normalized Inputs
The process of cleansing and normalizing anomalies in the raw log files presented significant problems. The names of resources parsed out of the referring URL often appear in variant forms (four or five different instances of ABI/Inform, for example) or with control characters introduced by browsers. The variability of title entries caused sorting errors and garbled the first log outputs. To deal with this, resource names taken from the raw log are rendered into continuous strings of alphabetic characters and matched against lists of similarly rendered title variations and normalized equivalents. A script substitutes normalized versions for aberrant entries, with about 98 percent accuracy; in the end only a handful of titles appear in data farm reports in more than one form. The normalization list also provides publisher name and information about formats (for example, that a resource contained abstracts, full text, PDFs, or other features).
The final file extracts used to build statistical reports comprise fields with demographic variables derived from the IP map and patron database, date and time stamps, normalized resource names, publishers, and format codes.
Reports
Within a few weeks, these processes generated a large, growing data repository, ready for input into SAS software for detailed analysis. This enabled library staff to count the number of times individuals logged on to licensed resources, with breakdowns by day, time, and location. It was possible to cross tabulate the use of specific resources by school, and to distinguish for the first time on-campus from off-campus access. Using fund information extracted from the library’s LMS, staff could identify per-login costs and begin gauging the cost efficiency of licensed products.
The repository was partitioned into monthly reports. Each appears in html format and in a delimited text form for downloading. The entire data set was then mounted on a library server for easy access by staff. By the end of this phase, the time frame required to extract, process, and distribute data had been reduced from weeks to hours, and the resolution of library statistics improved dramatically.
Expanding Functions
The data farm operation quickly began developing Perl modules to extract and parse data from a variety of transaction files and to spawn statistics for mounting on the Web. These additional transaction data include digital and physical resources. Since 2000, the data farm has been producing anonymized monthly reports following the procedures or refined versions of the procedures previously described. Demographic information (school, in some cases department, or administrative center) and, wherever possible, the service location, are standard features of each report. A list of reports appears in figure 1.
|
Report
|
Metrics
|
|
Database & e-journal use (non-proxy)
|
Logins sorted by title and frequency (in html and text format)
|
|
Database & e-journal use (proxy)
|
Logins sorted by title and frequency (in html and text format)
|
|
OPAC activity
|
No. of known item searches, browses, redirects, help use, sessions, pages hits, frequency of search parameters, searches by IP domain
|
|
Circulation (book checkouts)
|
Checkouts by status, by subject class and status, by subject class and IP domain
|
|
Photocopier/printer use
|
Pages copied and printed by library location (down to the floor) and IP domain
|
|
E-book use (Oxford & Cambridge / Mellon-funded Project)
|
Titles viewed, titles downloaded by IP domain
|
|
Gate counts (based on ID card swipes)
|
Count of visitors cross tabulated by visitor status and school. Cross tabulations of visits by day and hour
|
|
Image collection use
|
Frequency of image views, search types, views by IP domain. Summary data plus report builder
|
|
Browser analysis
|
Frequency of unique IP addresses hitting the main Web site and proxy server by browser type and version
|
|
Consultative & instructional session
|
Frequency of consultative & instructional sessions systemwide, with demographic breakdowns, mode of session (eg in-person, chat) measures of session length and length or prep time, locations. In summary and session by session form.
|
|
Annual data collection
|
Metrics used by ARL and IPEDS statistical reports
|
|
Visits to www.library.upenn.edu*
|
Frequency of unique hosts hitting the library Web
|
|
Online books page use*
|
Pages served
|
|
Blackboard courseware use*
|
Pages served
|
|
*These reports are generated by WebTrends software and the only uses of a third-party application in the data farm.
|
|
Figure 1. Examples of Data Farm Reports
|
From Static to Dynamic/Integrative Measurement
Until recently, data-farm reports appeared solely as tables in static HTML format and as delimited text files which are downloadable from the Web. Beginning in January 2003, however, library staff began logging transactions asynchronously and also in real time into Oracle tables, thus constructing a set of query-able databases of usage information. The databases continue to produce summary statistics in static monthly snapshots, but now dynamic report-building capabilities supplement the automated summaries. The report builders offer staff the option of generating more granular and focused outputs than the summaries provide. Report building functions are based on a set of templated SQL (sequential query language) queries that are made against the underlying Oracle structures with the use of Web forms. Thus, staff can build reports in the data farm without having to know SQL programming. So far, the data farm uses Oracle databases to generate measures for the library’s five image collections and to track reference/instructional activity. A third Oracle-based application, presently in development, will enable staff to generate reports detailing the use of all electronic resources originating from the library Web, whether licensed or freely available.
In summer 2003, Penn will migrate the data farm to a Sun-Solaris server and a dedicated Oracle installation. The goal will be to develop table space integrating all the transaction data available to the data farm, enabling staff to move from simple, descriptive statistics to more interesting forms of data mining. In addition to integration, the Oracle migration will provide a number of operational benefits, including:
- greatly reduced data storage demands;
- faster development and processing of descriptive statistics than log-mining supports;
- the enabling of links to external data sources, such as LMS, for fund information and various kinds of bibliographic variables;
- streamlined processes to anonymize log data; and
- a more robust IP and ID mapping system for demographic resolution. 2
Moving beyond Transaction Data
The data farm is gradually expanding to include quantitative information generated by staff rather than by machines. An example is the Library Reference/Instructional database previously mentioned. Here a set of interactive forms provides a mechanism for librarians to record information about reference consultations and various kinds of instructional contacts. Input variables describe the type of consultation or classroom contact, the length of preparation and session time, demographic parameters, location information, and outcome indicators. The forms also contain free-text sections that librarians can use to store information about sources and search strategies they may use in future reference contacts. The database provides a downloadable text file containing staff inputs that can be opened and massaged in Excel for individual staff reporting needs. The database also spawns summary statistics, which are updated daily, for managers to review system-wide. A similar form-based system has been constructed for collecting annual operational data, including those used in reports to the Association of Research Libraries and Integrated Postsecondary Education Data System (IPEDS).
In the coming year, the data farm will attempt to mount a library staff census: a human resources database describing library positions and employees. The census will provide a means of tracking changes in staff deployment and library organization over time, and it may feed content to other databases for creating directories or Web pages.
Finally, the data farm acts as a repository of knowledge derived from survey research, ad hoc statistical projects—such as the North American Title Count—and various types of program assessments conducted by staff. A longer-term goal for the data farm is to index these knowledge sources and link them through a basic search structure.
Staffing and Resources
Graduate interns from the department of computer and information science of Penn’s School of Engineering and Applied Science perform most of the Perl and SQL programming used by the data farm. In a typical year, the library employs approximately 0.5 full-time equivalent of intern assistance. The assessment, planning, and publications librarian devotes about 15 percent of his time to project planning and management. The interns work under his direction, and he reports to the director for finance and administration.
In addition to Perl and SQL, the library has made limited use of Java and XML for certain recurring functions, such as form generation, Penn ID-lookups and IP mapping. PHP is presently the application of choice for serving forms used to query the data farm’s Oracle database.
Conclusion
The long term future of the data-farm project is contingent on several factors. The most important is the project’s capacity to inform planning and management with useful data. Thus far, data-farm resources have figured in Web and OPAC redesign, facilities planning, and collection development considerations involving e-resource selection. Just as important, the project has provided data that have been extremely valuable in establishing the library’s credibility among its funders: the deans and the finance and provostial staff of the university.
The future of the project also hinges on library staff’s success at creating a robust set of MIS functions. This will require the further integration of data sources and the creation of functionality that enables a range of staff to quickly and intelligently pose questions to data. Dynamic linkages will have to be built between the data farm and operational systems that support fiscal activities, document delivery/interlibrary loan, cataloging, and circulation functions. Finally, data-farm capacities will have to be built in ways that do not negatively impact the library budget or staff workload. This constraint highlights the importance of building assessment capabilities into the engines of service. If machines take on the heavy lifting involved in data capture and processing, then staff can be free for analysis and knowledge creation. In large part the data farm project owes its existence to decisions that have enabled the library to re-purpose its service delivery technologies for measurement. In the end, Penn's MIS will mature only if the library continues to view measurement as an aspect of service delivery and leverages the architecture that supports service delivery to accommodate management information needs.
Notes
1. The technique described here to capture and count database and e-journal sessions has been modified over time. But whatever the mechanism, the strategy has specific limitations: It does not allow us to capture sessions originating from bookmarks, and it inhibits demographic analysis when sessions originate in computer labs or other locations where machines are pooled. It has specific benefits as well. It provides uniform and known counting methods. It can be applied to all e-resources, locally created and licensed, and it is amenable to time series analysis. We have found a high degree of positive correlation between our internally-generated counts of database and e-journal use and the measures provided by vendors. Further research on this topic should be available later in 2003.
2. Oracle logging will lessen storage requirements by an order of magnitude. A 20mb Oracle table contains the equivalent of a month of raw Apache logs that presently consume 200 MB or more.
Joe Zucca (zucca@pobox.upenn.edu) is the Assessment, Planning, and Publications Librarian, University of Pennsylvania Library, Philadelphia.
| ITAL Vol. 22, No. 4|