Access and Discovery

Issues and Choices in the Design of DIFWICS

Jeremy Hylton
Corporation for National Research Initiatives
[email protected]

D-Lib Magazine, March 1996

ISSN 1082-9873

Introduction: Finding Papers on the Web

The World Wide Web is increasingly used as a vehicle for scholarly communication and, in particular, many authors make pre-prints and papers available from their Web pages. I came to realize how widespread this practice was in computer science when the preliminary program for the prestigious ACM Symposium on Operating Systems Principles was announced last summer, the conference's Web page [url 1] had pointers to only two of the 24 accepted papers. But using the Lycos search engine, I found many authors' home pages and there found links to almost all of the papers. According to the conference chair, I was not alone: 75 percent of the conference attendees had read at least one paper before the conference.

My experience is becoming typical: By the time a conference's accepted paper list or a journal's upcoming titles are announced, many of those papers are freely available. The strategy I used to find SOSP papers works well enough for a known item search, where I can be fairly certain that I have found the item I was looking for. But it is not particularly useful for more general resource discovery, where I have less clearly defined goals, e.g. finding all the papers on a particular subject or presented at a particular conference series.

In this article, I compare several systems that have attempted to bring greater organization to the on-line computer science literature. I focus on a system I designed, DIFWICS, and describe some of the design decisions and how they differ from those made in other systems.

DIFWICS: Automating the Discovery Process

The Digital Index for Works in Computer Science (DIFWICS) [url 2] is a prototype system that integrates bibliographic information, in multiple formats and from multiple sources, to provide a single, centralized index for the computer science literature. (The design and implementation of DIFWICS is described in my master's thesis [4].)

Part of the motivation for DIFWICS was to put to better use the great quantity of bibliographic data produced by the readers and writers of the computer science community. For many years, these researchers have used tools like BibTeX and refer to organize citations lists for papers. The tools produce simple, flexible, machine-readable bibliographic records that can be used to construct citation lists for publication. (A sample BibTeX record is shown below. For a full description of the format see Lamport's book on LaTeX [6].)

     
@TechReport{hylton96,     
  author =      "Jeremy A. Hylton",     
  title =       "Identifying and Merging Related Bibliographic Records",     
  institution = "M.I.T. Lab for Computer Science",     
  year =        1996,     
  number =      "MIT/LCS/TR-678",     
  address =     "Cambridge, MA",     
  note =        "Master of Engineering Thesis, E.E.C.S. Department",     
  url =         "http://ltt-www.lcs.mit.edu/ltt-www/People/jeremy/thesis/",     
  abstract =    "Bibliographic records freely available on the Internet..."     
}

Figure: An example of a BibTeX record

The collective effort of the computer science community has produced hundreds of thousands of bibliographic records. The records are of mixed quality -- typos and incorrect citations are not uncommon -- and there is substantial overlap -- a frequently-cited paper may appear 10 times. Nevertheless the records have tremendous value as a source of information for a discovery service. The author and title fields contains enough information to support searches for a particular author's works or works on a particular topic. Many optional fields, including abstract, keywords, and annote, provide much richer sources of information.

DIFWICS organizes these citations, identifying duplicate and related records, and provides an automated way to turn citations into queries for Web search services and technical report indexes. One of the primary goals of the system was to eliminate the need for authors or information providers to do anything special to make their documents easily accessible on the Internet. To make a paper accessible via DIFWICS, an author need only put the paper on the Web and submit a machine-parseable bibliographic record for it (in any standard format). DIFWICS will incorporate the bib record and the Web search engines will index the page. (Although the prototype implementation requires human intervention to ingest new records, the system's design allows for automatica monitoring of mail and ftp submissions.)

In addition to its simplified model for disseminating information, DIFWICS differs from other on-line system for finding digital documents in two substantial ways.

Unlike other systems, DIFWICS was designed to take advantage of diversity and heterogeneity in bibliographic records rather than avoid it. Duplicate records are merged into a single composite records of hopefully higher quality. (This view of diverse bibliographic material as an asset rather than a liability was introduced by Buckland, et. al. [2].)
It also identifies records that describe different versions of the same work, for example a technical report and a journal article, and presents them together. This feature is an example of what cataloging librarians call the collocation function -- showing what a collection has on a particular subject or by a particular author or showing the various editions and translations of a work. (This function is described better in an essay by Patrick Wilson [9], where he argues persuasively that collocation will be increasingly important for on-line catalogs.)

Other Systems for On-line Access to CS Literature

DIFWICS addresses of few issues in network-based resource discovery, but there are many more problems to be solved and many more people working on the problem. Indeed, there are several other projects working specifically on improving access to the computer science literature.

Two of the best-known services are the Networked Computer Science Technical Report Library (NCSTRL) [url 3] and the Unified Computer Science Technical Report Index (UCSTRI) [url 4].

NCSTRL (also described by Davis [3]) is a collaborative effort among several public agencies, universities, and research organizations. It seeks to provide a publicly-accessible open architecture for a network-based system of digital collections of computer science technical reports and new technologies. In its current form, NCSTRL uses the Dienst 4.0 [url 5] system developed at Cornell University.

Dienst allows users to search and view technical reports from about 40 institutions; the reports themselves are stored and indexed by the "publishing institution," but Dienst provides common interfaces for searching the distributed indexes and for viewing and down-loading reports.

UCSTRI, described by Van Heyningen [8], is a pre-cursor to Dienst and still a valuable service in its own right. It automatically indexes technical reports stored on 185 different ftp servers, but unlike Dienst, there is only a single, centralized index.

There are also several large collections of bibliographic records on the Internet. The most comprehensive one is Alf-Christian Achilles's collection [url 6] of 450,000 BibTeX citations, which brings together 720 individual bibliographies in one place. These citations vary widely in in quality: Some contain a minimal citation, while other contain an abstract, a URL, and detailed annotations. (DIFWICS incorporates a little more than half the records in this collection.)

Two other sites, David Jones's Hypertext Bibliography Project [url 7] and Michael Ley's database and logic programming bibliography [url 8], work with smaller collections but provide more elaborate interfaces organized around journal issues and conference proceedings.

Web indexes, like Alta Vista [url 9], Excite [url 10], and Lycos [url 11], are not specifically focused on computer science or library-like services, but they provide an essential service: full-text searches of millions of Web pages. Alta Vista is the largest index and has the most powerful search interface; Alta Vista indexes the full text of 22 million Web pages, while the next largest service, Lycos, only indexes the full text of almost 5 million pages and condensed versions of 7 million other pages.

Alta Vista's advanced queries allow you to search for strings of text, which makes it ideal for locating papers of the Web. Consider Jerome H. Saltzer's paper "Technology, Networks, and the Library of the Year 2000" [7]. The title consists entirely of common words, but an Alta Vista search for the quoted title string [url 12] returns only the paper and several citations for it.

In the long run, Web indexes may not be able to keep up with the Web's rapid growth. To build the index, Alta Vista must visit each page, index it, and examine the page for URLs of new pages to load. The network bandwidth, compute resources, and storage spaced required are substantial and will increase as the size of the Web does.

The Harvest research group [url 13] proposes a different paradigm for creating and sharing indexing information. It uses "gatherers," run by individual Web sites, to create indexing information and "brokers," which collects the indexing information from gatherers and provides a query interface. Bowman, et al. [1] describe Harvest in detail. A demonstration Harvest broker [url 14] was created to index computer science technical reports. Harvest addresses the problem of distributing and scale in indexing, but doesn't deal with many of the library-specific problems that DIFWICS and Dienst do.

Design Decisions Illustrated

1. Who should drive the system?

Dienst is aimed at information providers. An institution that issues technical reports can create an Dienst repository, containing individual reports and bib records. It provides tools to automate that process and makes it fairly straightforward for a new site to plug into the Dienst network. UCSTRI also uses institution-provided resources to build its index.

DIFWICS is user-driven. It can incorporate bibliographic records from any source -- Achilles's collection, the records used by Dienst, or the citation list of a single paper. This model allows users of the system to contribute records that improve the scope and quality of the collection.

DIFWICS tolerates diversity and overlap in the bibliographic records. It makes few requirements of the creators of bibliographic records, while Dienst requires some specific formats and specific tools.

It's hard to say if one design will produce better results than the other. As long as writers continue to produce machine-readable bibliographic records, DIFWICS should have a steady, free supply of data.

2. Providing human-friendly abstractions

DIFWICS and Dienst each provide some abstractions that provide a more intuitive interface for users. Dienst hides the complexity of multiple document formats behind a uniform document interface, and DIFWICS treats multiple, related bibliographic citations as examples of the same conceptual work.

One of the strongest features on Dienst is its user-interface, which provides access to documents, and not to files. There is a recognition that a document can be represented in many ways -- an HTML page, a Postscript file (or several Postscript files), or a series of scanned images -- and that users do not want to know what format the document is in so much as they want to know what they can use that format for. A typical Dienst page offers a "structural overview" of a report, individual page images, and an option to down-load a printable copy of the report. There is room for improvement, though; Kass [5] describes a richer set of abstractions for browsing digital documents.

DIFWICS's focus on "works" also provides an abstraction that hides some of the complexity of the heterogeneous source records from the user. The query interface returns a list of works, where item entry in the list may be consist of several documents that were published separately but share the same title and author. The interface eliminates redundancy in query results. It also identifies relationships that may improve access: If you know that a particular work was published as journal article and issued as a technical report, you have twice as many places to look for it; if your library doesn't subscribe to the journal, you can look for the technical report.

These abstractions are complimentary. A user of DIFWICS would benefit from the addition of a Dienst-like interface for browsing documents once they are located, and Dienst would benefit from identifying duplicates and related items in search results (Although it's focus on technical reports limits the number of duplicates, there are some technical reports that have been issued by more than one institution.)

Both of these abstractions are a significant improvement over the basic Harvest interface, which does create different indexing information for different kinds of documents (e.g. Postscript and text) but doesn't identify any relationships between objects. A Harvest broker that indexes several different files that represent the same technical report as well as a paper that represents the same work would treat each as a wholly separate document. With Dienst the many files for the TR would be presented as a single object, with DIFWICS all the items would be presented as a single work.

Though the Harvest system doesn't provide these abstractions, it would be possible to use Harvest brokers and gatherers as part of the underlying infrastructure for a library service that provides the kinds of abstractions used by DIFWICS and Dienst.

3. Integrating distributed and heterogeneous data

Each system has a slightly different approach to how data -- both documents and bibliographic records -- from different sites is integrated into the system and where the indexing information resides. Choices here affect how the system copes with failures (of computers, networks, programs, etc.), the semantics of searches, and how the system scales to large and decentralized collections.

In UCSTRI, the central indexing site collects bibliographic information from each of the publishers and keeps a central index of that information. Building the index is difficult because the information providers don't share any common access methods. They store their records in different formats and arrange them for access differently. The UCSTRI indexer looks for several different patterns that allows it to associate a bibliographic record with a document; it doesn't, however, make an attempt to understand the record or identify particular fields within it.

Integrating date is simpler with Dienst, because each site uses an agreed-upon format for storing bibliographic information and uses the same protocol for sharing it. Each publisher builds an index for its own reports, and Dienst provides a global search interface that allows users to query each publisher in parallel.

The primary difference between UCSTRI and Dienst, then, is that UCSTRI has a single index and Dienst has many distributed indexes. As a result, an Dienst search returns documents that match a query from all the servers that were available at the time of the search, while a UCSTRI query returns all documents that match, regardless of whether the source site is available. (Dienst actually maintains backup copies of the distributed indexes to be used when failures occur.)

DIFWICS integrates information from distributed sources, but keeps a a centralized collection of records for indexing and integration. The system automatically loads new records into its database, indexes them, and searches for related records among the already-indexed records. The centralized index stores all the bibliographic information, but unlike UCSTRI it does not store information about where the cited document is located. Instead, DIFWICS relies on external Web indexes to locate documents. (A DIFWICS citation can be viewed as a name, which requires a name resolution service to locate.)

Acknowledgments

This article does not necessarily reflect the position or the policy of the Corporation for National Research Initiatives (CNRI) or sponsoring parties, and no official endorsement should be inferred. The design and implementation of DIFWICS was performed while I was with the Library 2000 group at the M.I.T. Lab for Computer Science. That work was supported in part by the IBM Corporation, in part by the Digital Equipment Corporation, and in part by CNRI, using funds from the Advanced Research Projects Agency of the United States Department of Defense under grant MDA972-92-J1029.

URLs for cited Web pages

15th Symposium on Operating Systems Principles preliminary program
http://www.ubiq.com/SOSP/program.html
Digital Index for Works in Computer Science
http://cslib.lcs.mit.edu/cslib/
Networked Computer Science Technical Reports Library: A Brief Description of NCSTRL
http://www.ncstrl.org/Dienst/htdocs/Info/ncstrl.html
Unified Computer Science Tech Report Interface
http://www.cs.indiana.edu/cstr/search
Dienst interface for NCSTRL
http://cs-tr.cs.cornell.edu/
The Collection of Computer Science Bibliographies
http://liinwww.ira.uka.de/bibliography/index.html
The Hypertext Bibliography Project
http://theory.lcs.mit.edu/~dmjones/hbp/
Databases & Logic Programming, a bibliography server
http://www.informatik.uni-trier.de/~ley/db/index.html
Alta Vista
http://altavista.digital.com/
Excite
http://www.excite.com/
Lycos
http://www.lycos.com/
Alta Vista search for "Technology, Networks, and the Library of the Year 2000"
http://altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=.&q=%22Technology%2C+Networks%2C+and+the+Library+of+the+Year+2000%22
The Harvest Information Discovery and Access System
http://harvest.cs.colorado.edu
Query Interface to the CS Tech Report Broker
http://harvest.cs.colorado.edu/Harvest/brokers/cstech/
Over the last few months, the broker has been available only sporadically.

References

C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, and Michael F. Schwartz. The harvest information discovery and access system. In Proceedings of the 2nd International World Wide Web Conference. Oct. 1995.
URL: ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z
Michael K. Buckland, Mark H. Butler, Barbara A. Norhard, and Christian Plaunt. Union records and dossiers: Extended bibliographic information objects. In Navigating the Networks: Proceedings of the ASIS Mid-Year Meeting, Andersen, et al. (editors). Learned Information, Medford, NJ, 1994. pp. 42-57.
http://bliss.berkeley.edu/lis-tr/mid-year94/mid-year94.txt
James R. Davis. Creating a networked computer science technical report library. D-Lib Magazine, September 1995.
URL: http://www.dlib.org/dlib/september95/09davis.html
Jeremy A. Hylton. Identifying and Merging Related Bibliographic Records. Master of Engineering thesis, M. I. T. Department of EECS, June, 1996 (actually completed in February, 1996). To appear as LCS Technical Report MIT/LCS/TR-678.
URL: http://ltt-www.lcs.mit.edu/ltt-www/People/jeremy/thesis/
Andrew Jonathan Kass. An Interchange Standard and System for Browsing Digital Documents. Master of Engineering thesis, M. I. T. Department of EECS, May, 1995. Available as LCS Technical Report MIT/LCS/TR-653, June, 1995.
URL: http://ltt-www.lcs.mit.edu/ltt-www/People/delphi/thesis/
Leslie Lamport. LaTeX: A Document Preparation System. Addison-Wesley, Reading, Mass., 1985. Second Edition (Updated for LaTeX 2e, 1993).
Jerome H. Saltzer. Technology, Networks, and the Library of the Year 2000. In Future Tendencies in Computer Science, Control, and Applied Mathematics, Lecture Notes in Computer Science 653, Bensoussan and Verjus, editors. Springer-Verlag, New York, 1992, pp. 51-67. (Proceedings of the International Conference on the Occasion of the 25th Anniversary of Institut National de Recherche en Informatique et Automatique (INRIA), Paris, France, December 8-11, 1992.)
URL: http://ltt-www.lcs.mit.edu/ltt-www/Papers/inria.html
Marc Van Heyningen. The unified computer science technical report index: Lessons in indexing diverse resources. In Proceedings of the 2nd International World Wide Web Conference. Oct. 1995, pp. 535-543.
URL: http://www.cs.indiana.edu/ucstri/paper/paper.html
Patrick Wilson. The second objective. pages 5-16 in The Conceptual Foundations of Descriptive Cataloging. Elaine Svenonius, editor. Academic Press, San Diego, Calif., 1989.