Evaluation Methodologies for Information Management Systems

D-Lib Magazine
September 2002

Volume 8 Number 9

ISSN 1082-9873

Evaluation Methodologies for Information Management Systems

Emile L. Morse
National Institute of Standards & Technology
100 Bureau Drive, Stop 8940
Gaithersburg, MD 20899-8940
[email protected]>

Abstract

The projects developed under the auspices of the Defense Advanced Research Projects Agency (DARPA) Information Management (IM) program are innovative approaches to tackle the hard problems associated with delivering critical information in a timely fashion to decision makers. To the extent that each of the information management systems interfaces with users, these systems must undergo testing with actual humans. The DARPA IM Evaluation project has developed an evaluation methodology that can assist system developers in assessing the usability and utility of their systems. The key components of an evaluation plan are data, users, tasks and metrics.

The DARPA IM Evaluation project involved six IM project Principal Investigators (PI's) who devoted a year's effort toward developing a method for getting beyond exploring and implementing systems to actually planning and performing structured, hypothesis-based evaluations of those systems. Five IM projects participated in this effort while a sixth IM project was integrated into and evaluated within a larger effort. This article describes component systems, evaluation.

Introduction

The DARPA IM Program was created to

"...address the traditional, and still vexing, challenge of getting critical information to those who need it in a sufficiently timely fashion that it can contribute to the quality of the decisions they make. This problem is made more complex given the accelerating rate of scientific and technical discovery, typified by the ever-shortening time period for the doubling of information (currently estimated at 18 months). The objectives of the IM program are intended to explore and evaluate potentially radical alternatives to traditional approaches to information management." — DARPA IM Program description.

From the above program description, it is clear that exploration and evaluation are complementary activities. However, it is also a fact that system developers expend more effort on the former than the latter. Exploring new IM approaches and developing systems to express these new ideas must precede evaluation of the systems, but it is our contention that developing an evaluation plan can and should occur in parallel with system development. This article discusses the experiences of the DARPA IM Evaluation project team in designing and evaluating the systems described below.

During the development of information management systems—or any other type of complex application—most of the effort expended is on getting the system to run, incorporating novel features, and allocating resources to accomplish project goals in a timely fashion. Many times evaluation of these systems is viewed as something that can be postponed until the end of the process, but all too frequently there is no time to do the needed testing then; at other times, evaluation is not even factored into the goals of the development effort. If convenient, easy to use methods were available in an environment in which evaluation were being fostered, then evaluation might be an activity that serves as an end-point for development.

By providing:

flexible, well-known data collections,

profiles of user populations,

a classification scheme for IM systems,

collections of representative tasks based on system type, and

metrics for measuring effectiveness, efficiency, and satisfaction,

we envision that evaluation could become as integral to IM system development as is documentation or any other well-accepted facet of the software development cycle.

IM project PI's are experts at developing innovative systems, but they are not necessarily experts at performing usability tests or other types of evaluations. If this project succeeds in identifying and standardizing the components of a good evaluation methodology, investigators in the future will be able to:

select appropriate data sets and associated sets of tasks that can be accomplished with the data;

determine quickly what user characteristics are important when deciding on a test population;

choose metrics that have been shown to have the greatest degree of utility;

know how much time has been required in other studies to perform tests similar to the ones they are contemplating.

The availability of these resources should make the process of evaluating systems more manageable. The benefit to DARPA and other funding agencies is that those systems for which the agencies have contracted will have been evaluated to determine one or more specific benefits of using the system. Standard sets of test components have the potential to produce a win-win situation for both developer and funder. Of course, there is the potential problem of developers designing to meet the implied criteria, but if the criteria are well chosen, this will add to the quality of the project products.

To summarize, the goal of the DARPA IM Evaluation Project was not evaluation of systems. The goal was to:

document the complexities of evaluation for IM projects,

provide road maps and warnings for future evaluators, and

put the evaluation of DARPA funded IM projects on a sound basis.

The IM Component Projects

Six Principal Investigators for IM projects already in progress were recruited to participate in the DARPA IM Evaluation project. No attempt was made to choose the PIs based on particular systems; willingness to participate and interest in the topic of evaluation were the sole criteria for project participant selection. Initial brainstorming sessions were dedicated to developing a structure for the evaluation. The participants found that their projects logically fell into three categories: resource location, collaborative filtering, and sense-making. This categorization was purely ad hoc and was not constrained by any pre-existing taxonomy. After the pairings were established, one of the projects was enlisted to participate in another larger evaluation effort, leaving five groups in the evaluation project described here. Each of the three category groupings is described below along with a brief description of the component projects of each category. In the remainder of this article, the system names for the component projects and the names of the investigators will be used interchangeably (Note 1).

Resource Location

Information seeking is often viewed as a cyclical process. The first step is to identify which collections are likely to contain the answers to the user's current query. The work of French (1-3) and Gey & Buckland (4) target this phase of the process.

PIE (French)

"The Personalized Information Environment or PIE is a framework within which users may build and conduct highly customized searches on a distributed document collection of their own choosing. There are four driving principles behind the PIE: Customizability, Efficient and Effective Search, Controlled Sharability, and Privacy and Security.

In contrast to a typical Internet search of multiple information resources, where control of which resources are searched is in the search engine's hands, a PIE places the control in the user's hands. In the PIE formulation, descriptions of resources are made available to users who decide which resources to include in a search. The process of resource selection is highly interactive and might involve sample searches and then selection or de-selection of resources from the user's current personalized collection. Regardless of the degree of interactivity, efficient and effective search is provided within whatever context the current collection of resources defines. Since a user may spend considerable effort customizing a personal resource collection, it makes sense to allow sharing of that collection in constrained ways or using pre-defined policies while maintaining whatever privacy or security constraints might be placed on particular resources or users." (2)

Search Support for Unfamiliar Metadata Vocabulary (Gey & Buckland)

Fred Gey at the University of California at Berkeley contributed to this evaluation project using his work with Buckland on aligning metadata vocabularies. The basic idea is that different collections use different terms in their indexing schemes even if they refer to items or properties that a searcher would deem to be the same. For example, one source might provide 'car' as a keyword, while another referred to the concept as 'automobile'. In addition, the term used to index a document may be different from the term used in the underlying document. The approach taken by this project is to mine existing electronic library catalogs to create statistical mappings between vocabularies. Software modules called EVM's (Entry Vocabulary Modules) are then used to enhance search by mapping from the users' ordinary language to the metadata of the digital resource.

Collaborative Filtering

The systems developed by Kantor (5) at Rutgers and Daily & Payton (6) at HRL Laboratories depend on other information seekers to enrich the data available to a current questioner. Many applications currently being developed have this characteristic. Whether the goal is to provide subsequent searchers with the relevance ratings of prior investigators, or to put people in touch with others with similar interests, it seems advantageous to leverage prior information seeking work so that the subsequent searchers have the opportunity to explore enriched environments.

AntWorld (Kantor)

AntWorld has been described in a number of reports (7). Briefly, if a user of AntWorld searches the Web, the AntWorld system invites the user to provide judgments on pages he or she finds. The combined collection of those judgments and the text of the pages becomes a representation of the user's Quest. The AntWorld system then computes the similarity between the current user's Quest and the stored representations of previous users' Quests. In a two-step process, AntWorld finds Quests most similar to the current Quest, and then finds pages that were highly scored by the owners of those Quests. This information is integrated to provide a composite ranking of candidate pages on the Web. Using the computed similarity, the AntWorld system then permits the current user to jump directly to those pages that received the highest collective recommendations from users whose Quests were similar.

PackHunter (Daily & Payton)

The PackHunter collaborative tool is based on the idea that people who browse the same information spaces are likely to share common interests. Once the trails are captured, they can be analyzed to help potential collaborators find each other. In addition to Collaborator Discovery, people who already know others with whom they need to collaborate can use a feature called Collaborative Browsing (CB). CB is mediated through a visualization interface that depicts a user trail as a network of nodes. During a collaborative session, the interface will highlight current user locations on paths, mark pages for others as 'interesting', allow jumping to pages pointed out by others, and enable user paths to overlap at common pages or allow the paths to be viewed independently. Although the PackHunter system has other features, the investigators decided that during the DARPA IM Evaluation project they wanted to restrict evaluation to the Collaborator Discovery and Collaborative Browsing features.

Sense-making

The fifth project, Genre, is targeted for people to use during active information seeking. Genre relies on both structured and semi-structured data collections. The category into which Genre fits was created based on its similarity to other projects from the DARPA IM Program.

Genre (Sankar)

"Genre supports situation understanding by supporting exploration of the information space that is relevant to the situation being analyzed. Genre supports exploration by helping the users relax or refine queries based on user access patterns and based on the task model. The query modulation is based on classifications and clusters that are learned by monitoring users' actions in sessions and based on query performance over the WAN. Furthermore, users can assign their own semantic categorization to these sessions. Events that happen within the context of the modulated semantics of the queries are sent to the user by the system. This mode of the system sending events to the user while the user is in the midst of a query modulation session is what we call mixed-initiative exploration." (8)

Key Issues and Implementation Decisions

In devising a structure that could be applied to all the component projects, we started off with the idea that

each of the investigators would perform an independent evaluation;

each evaluation would entail the use of human subjects;

the subjects for each evaluation would be domain experts;

the tasks would be realistic in terms of the target users of the system; and

the design of each evaluation would be based on hypothesis testing rather than on alternatives that are primarily qualitative.

However, the fact of the matter is that significant deviations were experienced in most of these primary goals. For example:

While all investigators produced evaluations of their systems, some chose to work in pairs.

One pair of testers (Resource Location) devised an ingenious alternative to human subjects.

One set of human subjects did not have domain expertise; they were college students.

These and other variations from the overall plan seem not to have had a significant impact on the results of the study. The mere existence of a plan forced the investigators to develop a rationale for their modification(s), and the project group provided a forum where the changes could be debated. The following sections will present the rationale for constraining each of the factors considered in this project. Each section will also detail the modifications made by the project participants.

Experimental Designs

All experimenters were expected to develop hypotheses and test them. This precluded participants performing iterative rounds of formative usability testing. The goal was to produce summative data, i.e., data than can be described by a measure of central tendency and variation. The comparisons the teams made differed, based mainly on the system that they were testing and the features of the system the teams deemed most critical to demonstrate.

Since the Resource Location groups (Gey & French) worked as a team, their hypotheses were tested together. The fundamental question they were addressing was whether augmented queries had utility for collection selection and/or document retrieval. The output from one system provided input to the other, and the ultimate results provided evidence for both information activities.

Both Collaborative Filtering groups decided to test whether their collaborative methods were superior to a similar system that did not employ collaboration. Essentially each Collaborative Filtering group developed a defeatured version of their system to use as a basis of comparison.

Finally, Genre, the Sense-making system, was compared with a system that was currently in place. The rest of the design employed actual analysts attached to various defense agencies as subjects, and the tasks were the things that the analysts did routinely. This comparison, on its face, is straightforward. However, working with busy analysts in demanding, real-world environments proved to be a high-risk challenge.

In summary, each participant developed testable hypotheses that addressed the key features of their systems. Most of the early group discussions centered on determining how to develop worthy comparisons and the effort resulted in a variety of valid experimental designs.

Subjects

It was assumed that the evaluations would be user-centered. Problems related to algorithm implementation, system efficiency, accuracy of data sources, etc., were not what needed to be tested. The desired evaluations would answer questions like: 'Can people use this system?' or 'Does the system help people do their jobs better?'. 'Better' could mean improved in terms of efficiency, productivity, satisfaction, timeliness, or any of a set of similar qualities.

Information analysts are envisioned as the users of DARPA-funded Information Management systems. An analyst's time is valuable and access to analysts was limited. The teams and individual projects solved this problem in a variety of ways. The French & Gey team resorted to an approach that required no subjects at all. They devised a testing paradigm that utilized an 'Oracle' which acted as a pseudo-subject (9). The parts of the resource selection activity performed by the system alone were tested in situ, while the parts that would normally be performed by a human user were simulated statistically. The Kantor project recruited not only professional information searchers, but also retired NSA analysts. This is the same approach used in the Text REtrieval Conference (TREC) studies (10). The Daily/Payton project used college students. Although studies using non-experts might seem to stray significantly from the goals of mimicking the target users, the investigators used the study to determine the utility of various metrics in their collaborative environment. In this case, it appeared less likely that expert/novice differences would exist. Lastly, the Genre project planned to use actual analysts. These people were made available through personal contact with a DARPA customer.

A sub-issue with regard to the use of human subjects is the requirement of obtaining Institutional Review Board (IRB) permission to perform the studies. Each of the teams using human subjects prepared the required documents and received approval. This process was foreign to several of the groups, and the existence of a project group that did have experience in preparing the documents was helpful. The experienced group provided templates that could be used to produce the submission materials for a variety of organizations.

In summary, of the five teams that developed experimental designs, two used simulated humans, one used college students, one used not only professional information searchers, but also retired analysts, and one used currently employed analysts.

Collections

One of the goals of this meta-study was to determine if a single data collection could be used for all evaluation efforts. It was hoped such a collection could form a core resource that could be used by later investigators. However, initial discussions showed that this goal would overly constrain the testing of the various systems. No single collection or Internet resource appeared to be able to suit the strengths of all of the IM projects. Therefore, the DARPA IM Evaluation project group considered alternatives that matched the needs of their particular systems. Each team kept in mind that small or tailored data sets would be less desirable than larger, more flexible ones.

The Collaborative Filtering teams had particularly interesting problems in selecting and conditioning data sets. Since both teams intended to compare the effect of prior user interaction with the data vs. no prior value-added activity, it was necessary to consider how such a data set could be created and maintained. Since each test subject would need to see the data in precisely the same state as each other test subject, static collections would have to be generated. The situation was like seeing a map of all the paths that people have taken through a landscape; during certain phases the emphasis was on who laid down footprints and where the footprints were placed, while at other times, subjects saw only the final map of tracks.

The teams solved the problem of data set/collection in the following ways. The French/Gey team used the OHSUMED (11) data set. They found advantages to using a relatively large set of collections from this medical literature. The indexing was performed by applying the MeSH (Medical Subject Headings) indexing scheme. The systems of the Collaborative Filtering group most naturally address the Web as a whole. However, the need to produce conditioned (pre-tracked) trails required the Collaborative Filtering groups to select portions of the collected documents. They prepared subsets by using a two-step process; in the first pass, one set of subjects laid down tracks and, in the second, the subjects were restricted to following those paths. The Genre project managed to gain access to the actual data used by their intended analyst subjects.

In summary, all the projects used large collections appropriate for the tasks the subjects/systems would be required to perform. In all cases, the participants felt that their systems were being overly constrained and that they could deal with significantly larger problem spaces.

Tasks

The a priori vision was that the tasks subjects/systems would be asked to perform would be realistic - realistic in terms of data set and realistic in terms of subjects/target users.

The members of the Resource Location team took advantage of the fact that the OHSUMED collection contained queries in addition to documents. Once again, it is pertinent to note the similarity with the method used in TREC (10). The Collaborative Filtering team used two distinctly different approaches based on their prior choice of subject population and goal of the experimental design of their projects. Kantor's design required subjects to produce summary documents similar in content to actual analyst reports. The subjects were asked to prepare a report on one of the following topics:

Anthrax: detection, prophylaxis, treatment and use as a weapon;

Terrorism on the Web: overtly terrorist sites; sites that covertly support or link to overt sites, under guise of charities; sites that seem to be endorsed by either of the other two kinds of sites; or

Development of nuclear weapons by non-governmental organizations: reports of loss of nuclear raw materials; reports on capabilities for making weapons; issues of transporting nuclear weapons to the target locations.

Daily & Payton were interested in developing valid metrics for evaluating systems. Since they were using students as their subject material, they chose to use tasks with which their subjects would feel comfortable (jazz/Louis Armstrong, sports/Babe Ruth, French impressionism/Edouard Manet, film/Charlie Chaplin). They prepared questionnaires that would probe their subjects' knowledge of collected material. The methods used by their subjects were enforced by the system and its interface, and would be the same even in the hands of domain experts. It seems reasonable to assume that similar results would be obtained if experts were tested in more realistic environments. Finally, the Sense-making, Genre system employed not only real data and real analysts but used the actual tasks analysts were called upon to make with their current, non-Genre tool.

Results

The results of this meta-study are not the results of the individual projects but rather descriptions of the various studies that were devised and, in 4 out of 5 instances, performed. The following table shows how the five project studies were structured.

**Table 1: Summary of Evaluation Components**
	Experimental Design	Subjects	Collections	Tasks	Measures
French	Retrieval with base and augmented queries	--	OHSUMED	OHSUMED queries	Precision; merit
Gey	Free text vs. augmented queries	--	OHSUMED	OHSUMED queries	Precision; merit
Kantor	AntWorld vs. null system	Retired intel analysts and reference librarians	Web	Prepare reports on timely topics	Subjects rate reports of others on multiple criteria
Daily/Payton	PackHunter with and without collaboration feature	College students	Web	Collect documents relevant to a topic area; answer questions in an open-book format	Performance on 30-question test
Sankar	Current system vs. Genre	U.S. Pacific Command (PACOM)/Joint Operation Planning and Execution System (JOPES) personnel	Time-Phased Force and Deployment Data (TPFDD)	TPFDD query and modification	Subject ratings of 'ease of use'; time to completion

The experimental designs, subjects, collections and tasks have been described in the previous sections of this article. The measures shown in the final column in the table above were not controlled by the study's design but were chosen by each investigator based on other variables. The measures were selected to test the specific hypotheses and were shown to be sensitive in the comparisons that were made.

French & Gey used the classical precision metric for assessing retrieval performance. The use of merit to evaluate collection selection has been discussed previously (12).

Kantor's goal was to determine if subjects using AntWorld would produce better reports than those who used the control analogue. He chose to have his subjects rate each other's reports. Another alternative would have been to find independent raters. However, the analysts who prepared the reports were undoubtedly more knowledgeable about the topics than a board of independent raters would have been.

As mentioned previously, Daily & Payton used this study as an opportunity to evaluate metrics. They asked their subjects to collect pages that they believed were useful for the topic they had been assigned. They gathered materials using PackHunter's collaborative features or using the control interface that did not provide collaboration. Later, they tested their subjects' knowledge of the topic by administering a 30-question test; subjects were permitted to refer to their collected materials. Subjects who used PackHunter scored higher than subjects who did not. The test results showed that the measure was sensitive enough for their comparison.

Sankar planned to use classic usability metrics — efficiency and satisfaction.

Of the six projects initially recruited, the five described here developed full-scale evaluation plans. All but the Genre project went on to perform the study detailed in the plan. The results of the Resource Location team have been published (9). A detailed description of Kantor's study is available (13). The Genre study was aborted due to the events of Sept 11, 2001. The personnel who were scheduled to take part in the study received new, high-priority orders that precluded their participation. The goal of this meta-study was to encourage and support the development of evaluation protocols. We believe that all five projects succeeded with respect to the meta-study.

Conclusions

The key observation of this meta-study is that evaluation of complex information management systems is not only possible but also feasible. Further, evaluations can be performed by people with widely divergent backgrounds in designing experimental protocols. The evidence for these observations and the following conclusions is based on information provided by the investigators of the component projects.

Why was it so easy to get to the evaluation phase? The best explanation is that the investigators were given sufficient resources to devote significant effort to evaluation activities. It is highly likely that unless pressure is brought to bear on system developers to perform serious testing, there will never be enough time or money in the budget to arrive at a final assessment of the usefulness, utility or usability of systems. Perhaps the best advice for Program Managers is that if they are truly interested in having systems evaluated, they should require a plan for system evaluation as a separately funded project stage and then require the proof to be delivered. A new balance needs to be achieved between "explore and evaluate" as stated in the quote in the Introduction.

Contributing to the success of this project were the use of a team approach and the formation of initial pairings. Collaboration within the larger group and smaller teams kept the level of discussion high. The investigators shared what they knew to the advantage of all project participants. The project environment was non-threatening, and less experienced members asked questions easily. The groups provided a forum that fostered creativity but could be tough on approving design modifications. My personal impression is that we worked in much the same way that a doctoral seminar group works — by being critical yet supportive.

The project groups made progress around obstacles that sometimes can kill the best intentions. One example was in the templates for IRB forms. The participants who were accustomed to filling out the many pages of required documentation offered samples for others to tailor. Consequently, reviews were handled expeditiously and without delay. Networking with the TREC researchers at NIST to access the retired analysts used in the Kantor study provides another example of how an obstacle was overcome.

Although the individual projects were successful, we did not discover a magic bullet that will solve all the problems in 'getting to evaluation'. Somewhat contrary to expectations, toolkits of interchangeable data sets, user profiles, study designs, task collections, and metrics were not developed. Instances of each of these are available by contacting the author and/or the individual PI's. The interdependence of the factors makes it hard to envision the performance of only 'clean' studies, i.e., designs composed of large, recent, well defined data sets tested with highly motivated domain experts using timely, significant tasks and measured with numerous, high-quality metrics. What we know from doing this study is: it isn't necessary to perform a perfect test, and high-quality testing is within the capabilities of the research teams who develop the systems. With proper management, motivation, and support, Program Managers can ensure that effective evaluation will be a part of any project for which it is appropriate.

Acknowledgments

This work was supported by DARPA Agreement #K928. The conclusions are not necessarily those of DARPA.

The author is indebted to the investigators of the IM Projects tested in this study. My thanks to Michael Buckland, Mike Daily, Jim French, Fred Gey, Paul Kantor, Dave Payton, and Sankar Virdhagriswaran for thoughtful discussions, their boundless enthusiasm and lots of hard work.

References

[1] Personalized Information Environments, <http://www.cs.virginia.edu/~cyberia/PIE/>.

[2] Personalized Information Environments, explanation, <http://www.cs.virginia.edu/posters/pie.pdf>.

[3] J. C. French and C. L. Viles. "Personalized Information Environments: An Architecture for Customizable Access to Distributed Digital Libraries," D-Lib Magazine 5(6), June 1999, <http://www.dlib.org/dlib/june99/french/06french.html>.

[4] SIMS Metadata Research Program, <http://metadata.sims.berkeley.edu/GrantSupported/unfamiliar.html>.

[5] AntWorld Papers, <http://aplab.rutgers.edu/ant/papers/>.

[6] PackHunter, <http://www.hrl.com/TECHLABS/isl/FeaturedResearch/PackHunter/packhunter.htm>.

[7] How the AntWorld Works, <http://aplab.rutgers.edu/ant>.

[8] Genre presentation, <http://www.dyncorp-is.com/darpa/meetings/im98oct/Files/crystaliz/genre-v3.ppt>.

[9] J. C. French, A. L. Powell, F. Gey, and N. Perelman, "Exploiting A Controlled Vocabulary to Improve Collection Selection and Retrieval Effectiveness," Tenth International Conference on Information and Knowledge Management (CIKM 2001), Nov. 2001, pp. 199-206.

[10] E. M. Voorhees, and D. Harman, Overview of TREC 2001, <http://trec.nist.gov/pubs/trec10/papers/overview_10.pdf>.

[11] W. Hersh, C. Buckley, T. J. Leone, and D. Hickam. "OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research." In Proc. ACM SIGIR '94, pp 192-201, 1994.

[12] J. C. French and A. L. Powell. "Metrics for Evaluating Database Selection Techniques." World Wide Web, 3(3), 2000.

[13] Kantor PB, Sun Y., Rittman, R. Prototype for Evaluating a Complex Collaborative Information Finding System for the World-Wide Web: Evaluation of The Antworld System Final Report <http://scils.rutgers.edu/antspace/FinalReport/APLabTR-02-01AntWorld.doc>.

Note

[Note 1] PIE = French; Search Support for Unfamiliar Metadata Vocabulary = Gey; AntWorld = Kantor; PackHunter = Daily & Payton; Genre = Sankar.

(30 September 2002, the following corrections have been made to this article: A URL for reference 3 was added, and in reference 12, the first initial (J) was added to J.C. French's name.)

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/september2002-morse

D-Lib MagazineSeptember 2002

Volume 8 Number 9 ISSN 1082-9873