Arms: D-Lib Metrics Position Paper

Replication of Results
and the Need for Test Suites

William Y. Arms
CNRI
[email protected]

Preliminary Draft
January 2, 1998

The overall objective

The current phase of digital library research is highly empirical. A researcher who is developing a new concept implements software that incorporates the concept, demonstrates it with some trial set of data, reports observations on the results, and encourages others to build on the work. This is an effective method of working during the early stages of an experimental field, but as the field matures, we need a more systematic methodology.

For example, three of the current DLI projects are doing work in image recognition. Each is tackling a different aspect of the same problem: to be able to search collections for images that match specific criteria. However, the three projects are using their work in different applications and with different data. Therefore, any comparison of the three approaches is highly subjective.

There are two closely related needs:

Replication of results: It should be possible for other researchers to repeat experiments, with different data and different implementations, and to replicate the basic results.
Measurements: The result should be evaluated against relevant, repeatable criteria, so that strengths and weaknesses of alternative approaches can be compared and improvements measured.

The need for test suites

Hopefully, the D-Lib Metrics working group will help the development of ways to measure the effectiveness of various aspects of digital library research. The next requirement is standard test data that researchers can use to evaluate their work.

I envisage a test suite that consists of a group of standard sets of test data that represent the major categories of material in digital libraries. The requirements for the test suite are demanding:

Each data set needs to be quite large for worthwhile measurements to be made.
The data sets must be kept on-line for measurements that involve humans in the loop or other interactions. This requires careful selection of interfaces and permissions from rights holders.
There must be data sets that represent the wide range of formats and genres of material, e.g., SGML journals, photographs, maps, video clips, http Web sites, etc.
For experiments in distributed digital libraries it is highly desirable to have independent data sets covering similar formats and subjects.
Changes in the data sets must be carried out in a systematic manner than provides continuity in the results.

wya
January 2, 1998

Replication of Results and the Need for Test Suites

The overall objective

The need for test suites

Replication of Results
and the Need for Test Suites