The D-Lib Test Suite: Testbeds for Digital Libraries Research, University of Illinois, Urbana-Champaign

The University of Illinois at Urbana-Champaign:
Full-text scientific and engineering journal articles

The collections

The Illinois testbed of full-text scientific and engineering journal articles was developed as part of a Digital Library Initiative (DLI) grant awarded to the University of Illinois at Urbana-Champaign (UIUC) in 1994. Work on the Illinois DLI grant was carried out by a multi-departmental research team comprised of individuals from the university's Graduate School of Library and Information Science, the University Library, the National Center for Supercomputing Applications (NCSA), and the Department of Computer Science.

The testbed is constructed from source text journal articles contributed by several professional society publishers. The Illinois DLI testbed is presently comprised of articles full-text in SGML format, the associated article metadata, and bit-mapped images of figures for 63 journal titles containing over 60,000 articles from six scholarly professional societies in physics and engineering. The full-text articles for the testbed have been contributed by: the American Institute of Physics (AIP), the American Physical Society (APS), the American Society of Civil Engineers (ASCE), the Institute of Electrical and Electronics Engineers Computer Society (IEEE CS), and the Institution of Electrical Engineers (IEE).

DeLIver

The testbed team has implemented a large-scale web-based testbed of full-text journal articles featuring enhanced access and display capabilities. The web-based retrieval system developed by the DLI Testbed and Evaluation teams is called DeLIver (Desktop Link to Virtual Engineering Resources). The DeLIver client, which replaced a Microsoft Windows-based custom client in use for the first two years of the project, has been in operation since September of 1997 and is being used by over 1,200 registered UIUC students and faculty, as well as designated outside researchers. Sample pages from a DeLIver search session are shown in Figure 1 below. Detailed transaction log data of user search sessions (gathered and merged from both database and web servers) are being kept and a preliminary analysis of user search patterns from some 4,200 search sessions has been performed.

The overarching focus of the DLI testbed team has been on the design, development, and evaluation of mechanisms that can provide effective access to full-text engineering and physics journal articles within an Internet environment. The primary goals of the Illinois testbed have been:

construction and testing of a multi-publisher SGML-based full-text testbed employing flexible search and rendering capabilities and offering rich links to internal and external resources;
integration of the testbed (and full-text in general) as a resource for users into the continuum of information resources offered to end-users by the Library system;
determination of the efficacy of full-text article searching vis-à-vis document surrogate searching, and exploration of end-user full-text searching behavior in an attempt to identify user-searching needs;
identification of models for effective publishing and retrieval of full-text articles within an Internet environment and employment of these models in the testbed design and development.

Retrieval capabilities

To support effective retrieval in the testbed, the Illinois DLI Testbed and Evaluation teams have also carried out studies of end-user searching behavior in an attempt to identify user-searching needs. One requirement specified by the testbed team from the onset of the project has been that the testbed as a resource for users must be integrated into the continuum of information resources offered by the Library system. This has been primarily accomplished in two ways: by making the testbed a search option within the Library public terminal top-level menu; and by linking testbed full-text records from the short entry displays within the Ovid Compendex and Inspec periodical index databases. Additional simultaneous search mechanisms are being explored.

The cornerstones of the testbed, in terms of retrieval capabilities, are the effective utilization of the article content and structure revealed by SGML and the production of the associated article-level metadata, which serves to normalize the heterogeneous SGML and provide short-entry display capability. The metadata also contains links to internal and external data in the form of forward and backward links to other testbed articles and links to A & I Service databases (particularly Ovid INSPEC and Compendex databases) and other full-text repositories, such as American Institute of Physics, the American Physical Society, and Elsevier. The metadata and index files, which contain pointers to the full-text data, can be stored independently of and separately from the full-text.

Current research

The testbed team is in the process of converting the SGML publisher data into well-formed XML (eXtensible Markup Language). The XML data can then be rendered natively in a web browser and/or converted to HTML to be rendered using emerging web technologies such as Cascading Style Sheets (CSS) and Dynamic HTML (DHTML). It is clear that a rich markup format such as XML, which is a nearly complete instance of SGML, will become the language of open document systems, to be used in web environments for document representation and delivery. XML and SGML permit documents to be treated as objects to be viewed, manipulated, and output. The major strength of these markup languages, in terms of their retrieval capabilities, is their ability to reveal the deep content and structure of a document. While SGML/XML are becoming ubiquitous in the publishing world, it is still, for the most part, being generated by publishers as a byproduct for archiving, rather than serving as an integral, integrated part of their production process.

The Document Type Definition (DTD) accompanying an individual publisher's SGML is the instrument that actually specifies the semantics and syntax of the tags to be used in the document markup. The DTD also specifies the rules that describe the manner in which the SGML tags may be applied to the documents. One of the major roadblocks in the successful deployment of the testbed has been the processing involved with the heterogeneous DTDs of the publishers. In the process of creating a viable testbed, the Illinois testbed team developed a number of techniques to address problems and normalize SGML processing, indexing, storage, retrieval, and rendering.

The testbed team has also studied the issues connected with the proper rendering of mathematics in web-based scientific and engineering articles. The Team has explored several techniques for properly rendering mathematics, and this will continue to be a major issue connected with scientific publishing.

An important concern of the testbed group has been in exploring effective retrieval models for a web-based electronic journal publishing system. The retrieval and display of full-text journal literature in an Internet environment poses a number of issues for both publishers and libraries. It has now become commonplace for both major and small-scale publishers to provide Internet (web-based) access to their publications, particularly journal issues and articles. For libraries and information providers, support for the online journal environment necessitates changes in collection policy, user access mechanisms, equipment provision, etc.

The testbed team has been examining the issues involved in the switch from a print-based journal environment to the Internet-based model, with a special eye toward providing retrieval mechanisms to optimize user access to full-text journals. To support this, the testbed team has proposed a distributed repository model that "federates" or connects the individual publisher repositories of full-text documents. In the DLI testbed model, these distributed repositories are federated by the extraction of normalized metadata, index, and link data from the heterogeneous full-text of the different publishers. This model addresses the challenge of providing standardized and consistent search capabilities across these distributed and disparate repositories.

The testbed team has succeeded in demonstrating the efficacy of the distributed repository model by producing cross-DTD metadata, providing parallel database querying and distributed retrieval techniques across a distinguished subset of the full-text repositories, and by setting up and employing an off-site repository at the site of a publisher.

Particularly relevant for any electronic journal publishing and retrieval model is the prominent role being played by the professional societies and commercial publishers. Electronic publishing of scientific articles is coalescing around the current professional society and commercial publisher model that dominates today's print-centric world. The testbed team distributed, repository retrieval model provides a mechanism for retrieval across subsets of the full-text publisher repositories without the requirement of going to each publisher site to perform individual searches.

The testbed team has recently converted the testbed metadata into an RDF (Resource Description Framework) XML format as a preliminary step to studying compatibility issues with the metadata structure of the Dublin Core.

In the next three years, with DARPA support, the testbed team expects to continue work on issues connected with full-text article indexing, retrieval, and rendering. In addition, a Collaborating Publishing Partners program has been instituted to provide additional support for the testbed. The testbed team is looking forward to collaborating with research partners on exploring extended testbed functionality. Testbed team members expect to focus additional work in the following areas:

Further implementing the distributed repository model by connecting the Illinois testbed to other (off-cite) publisher full-text repositories;
adding additional full-text journals to the testbed;
providing interoperability between the D-Lib suite repositories by converting their metadata to a compatible RDF format and providing a cross-repository metadata retrieval system;
exploring the use of Digital Object Identifiers (DOIs) and other standard journal article identifiers;
incorporating enhanced dynamic linking;
simultaneous searching and linking of A & I Service databases, collaborating partner databases, online catalogs, and full-text repositories;
using Cascading Style Sheets (CSS-I and CSS-II), Dynamic HTML, and distributed font sets in article rendering;
further exploring the use of XML in article publishing;
using the Mathematics ML specification within XML;
migrating retrieval technologies developed for the custom client to the web client, e.g., dynamic retrieval mechanisms such as author and term word wheels and dictionaries within a web browser;
expanding the provision of simultaneous searching of remote information sources, using Z39.50 and web-based technologies;
expanding user studies;
adding advanced user navigation and gateway techniques, including the work on co-occurrence matrices and automatic construction of search concept spaces.

Further information

For general information about this testbed, see the web site: http://dli.grainger.uiuc.edu/.

Researchers with serious interests in using the testbed, should contact: Timothy Cole, [email protected].

[ Testbeds ]