This paper explores how resource discovery in the networked (or digital) environment presents new opportunities for the use of surrogates. Our argument has several elements. We show that resource discovery is a complex process that includes multiple phases, is iterative, and highly dynamic. In agreement with others, we argue that resource discovery involves the manipulation of object surrogates, which are vital because of their "order-making" role. Resource discovery in the physical realm -- for example searches in the traditional library -- has been constrained by the use of static surrogates. In the traditional library these are the card catalog and its digital analog, the OPAC. These static surrogates have been quite successful as resource discovery aids, albeit frequently with the mediation of trained reference librarians. We argue that significant improvements to resource discovery in the networked realm can be made using techniques that match surrogate semantics to the instance-specific requirements of the resource discovery process. This can be accomplished most easily through architectures, such as the Warwick Framework, that allow the association of multiple surrogates with objects, or more ambitiously through methods that construct derived, or dynamic surrogates that respond to current resource discovery needs.
Our focus here is the character of the process as it proceeds from initiation to realization of goal. By understanding this process we will be better equipped to formulate architectures to facilitate networked resource discovery. For a broader perspective on networked resource discovery and retrieval (NIDR), we refer readers to the draft of a White Paper on Networked Information and Discovery and Retrieval [CNI], being prepared for the Coalition for Networked Information (CNI). This white paper, while unfinished, is the clearest exploration to date on many of the issues relevant to discovery and retrieval in a networked environment.
Resource discovery is a long-term, multi-threaded, and iterative process with complex and dynamic requirements. We can characterize it as having a number of dimensions, whose relationships range from completely orthogonal to highly interdependent. We briefly describe some of these dimensions below, with the realization that a full examination of the process is the subject of a much more in-depth study.
An interesting perspective on the role of surrogates comes from David Levy [LEVY] of Xerox Palo Alto Research Center. Levy depicts cataloging as a method for creating an illusion of order (a schema) of the chaotic information universe. In his words "...it [cataloging] is a set of practices which quite literally put a library's collections in order and provide access through a set of systematically organized surrogates...". Cataloging and the surrogates it produces allow us, as information seekers, to assume that resources have a common set of attributes, such as title, author, subject, and the like. In fact, these attributes may not actually exist, but are derived from and associated with information objects as the result of professional cataloging. We use these attributes as the basis for formulating search criteria and a means of conceptualizing and examining the results of the searches
As argued by Lynch, Michelson, et. al., it is tempting to believe that in the age of networked information and digital libraries, the significance of surrogates to resource discovery will diminish. After all, if the full content of the object is available in digital form, why not use it as the basis for resource discovery in preference for some substitute object? They list a number of reasons why this logic is false including:
The librarian, in the traditional library setting, plays a vital role in this translation from current user requirements to the static surrogates. Terry Smith [SMITH], in what he calls the "meta-information environment" of libraries, describes this as an iterative mapping of cognitive models. The cognitive model of the user is based on her background, depth of knowledge of the subject, familiarity with the library, and current information requirements (manifested in the "query" for which she would like an "answer"). The cognitive model of the library builds on the information resources currently available and the meta-information (e.g., descriptive cataloging records and indexes) that acts as surrogates for those resources and related resources outside the library. Bringing these sometimes divergent models together is the raison d'être of a good reference librarian; and, as a side effect, the librarian hopes to advance the ability of the patron to perform this mapping with less assistance in the future.
We should note before continuing that the traditional library scenario described above, and the role of the reference librarian, has been enormously successful for resource discovery. We do not suggest that the purpose of digital libraries should be to entirely replace this interaction, or do we think that it is possible for the foreseeable future. We do recognize, however, that the traditional library model is unduly restrictive in many instances. Face-to-face interaction between patron and librarian is possible only when the library is open and the library is nearby. The proximity problem is somewhat alleviated by telephone contact and may further reduced in the future by video-conferencing technology and attendant improvements in networking technology. More problematic is the cost of the traditional library model, both in terms of the cost of professional cataloging and of the public service and reference service. Ideally, we would like to continue to provide this high-cost service when necessary, but use digital library technology to provide a more lightweight solution for resource discovery. This lightweight solution may be sufficient for the sake of convenience (7-by-24 service), as an alternative for more savvy or experience researchers, or for more informal resource discovery.
In this paper, we focus on opportunities for improving resource discovery for information in digital form; the Web and Internet being current realizations of that form. The fact is that while the quantity of resources on the Internet continues to expand at an explosive rate, there is not a commensurate advance in the tools for finding those resources. The current tool-set for networked resource discovery uses a model that has evolved little since the Archie [ARCHIE] tool for finding FTP resources. This model, known as "web-crawlers" or "web-indexers" and characterized by services such as Digital's Alta-Vista, relies on periodic global scans of the directed, cyclic graph that is web-space, using hyperlinks as the guide. The crawler uses the HTTP GET request to download each resource and then uses a variety of IR techniques to index the contents of that resource. Users then submit full-text requests to this centralized service (which may be distributed among many servers and replicated at multiple sites).
The technology behind these indexers has four notable problems.
What distinguishes current networked resource discovery from the traditional library model that was discussed in the previous section is the limited, if almost non-existent, role of surrogates in the process. Using Kunze's two-phase model as the basis for examination, we find that these web indexing services make no use of surrogates during the location phase. The notion of a common cataloging record format and a means of associated it with HTML documents are still in the development stages. These network resource discovery tools treats resources and queries as unstructured collection of tokens. In the process of query/resource matching, attempts are made to improve precision and recall using heuristics that attempt to interpret which tokens are of greater semantic relevance. As an aid during the examination phase, these services generally construct an informal surrogate to display search hits. This surrogate generally consists of the URL, title of the Web page, and some summary text that is derived with the aid of some heuristics.
We do not doubt that there is value in resource discovery tools that operate without the aid of structured surrogates. They allow automatic indexing and location of any resource. They suited for "needle in a haystack" type resource discovery tasks. However they can not be viewed as the solution for networked resource discovery, but as a compliment to more structured methods that make use of surrogates.
In the final section of this paper, which follows, we describe the potential for moving beyond the surrogate abstraction in the networked realm. Through the use of composite object architectures and processing techniques we can create a more adaptive surrogate abstraction and, as a result, more powerful resource discovery tools.
The effort to develop a method of associating descriptive metadata with HTML documents has most recently centered on PICS [W3C]. PICS was originally designed as an infrastructure for associating ratings with networked content. One targeted use of PICS was to enable parents to control what children access on the Internet. The PICS-NG [PICSWG] effort extends this infrastructure to make it possible to attach any descriptive labels, or metadata, with content.
Creating new descriptive surrogate standards for networked objects is essential, but not sufficient. We argued earlier in this paper that the resource discovery is notable for its instance-specific set of requirements. In other words, no single descriptive standard is sufficient for the wide-range of needs - specific to role, granularity, phase, etc. - which overlap through the resource discovery process. In fact, any attempt to formulate an all-purpose descriptive standard for networked objects is in danger of revisiting territory already explored by the AACR2 and MARC community.
We argue that the more useful alternative is to consider an information object and resource discovery architecture that allows more complete matching of the semantics of the resource surrogate to the current resource discovery requirements. In the remainder of this section we suggest how this might be done. We propose a relatively simple mechanism that makes use of technology for associating multiple static surrogates with networked objects. We then suggest more complex mechanisms that make use of derived surrogates.
In an effort to provide some scope to the Dublin Core effort we developed the Warwick Framework [WF], a container mechanism for associating multiple sets, or packages, of metadata with intellectual objects. The Warwick Framework concept grew out of the recognition that there are many different forms of metadata that one might associate with networked objects. The information architecture should allow each form to be created, administered, and accessed independently. Finally, it should allow sharing and distribution of individual packages associated with a container.
An important application of the Warwick Framework is to encapsulate semantically distinct metadata forms, such as terms and conditions, provenance, and administrative metadata. We focus here on the capability of the Warwick Framework to encapsulate semantically overlapping metadata packages, in particular multiple descriptive surrogates for intellectual objects. For example, using the Warwick Framework, we can associate content objects with general descriptive forms such as the Dublin Core and MARC, and domain-specific descriptive forms such as that encoded in the Content Standard for Digital Geospatial Metadata (CSDGM) [FGDC]. Each descriptive form can provide data appropriate for a relatively specific niche in the resource discovery process. The Dublin Core is appropriate for coarse-granularity, domain-independent resource discovery. The MARC record is more appropriate for the finer granularity stages of resource discovery. Finally, the CSDGM package enables fine-granularity, domain-specific resource discovery. We expect that more metadata forms and extensions of forms will develop, which will provide other targeted semantics. As a side effect, we hope that evolution and use of the Warwick Framework will provide an incentive for metadata developers and standards efforts to maintain a targeted focus.
The mechanism for associating these multiple descriptive forms with network objects is the subject of the PICS-NG effort. We recognize, however, that no technology exists at the client side for automatically selecting among the descriptive surrogate formats. However, a simple manual solution would plausibly make a significant improvement in the current state of network resource discovery. Assume, for example, that over the near-term standards like Dublin Core and PICS-NG become widely adopted. We can then envision that the corpus of networked objects on the World Wide Web evolves to a mixture of "high-integrity" objects, which are packaged with one or more descriptive surrogates, and "low-integrity" objects, which are only stand-alone content. Search service providers, such as Alta-Vista, might then add selectable options to their interfaces that allow the user to fine tune their searches. One option might be to search only high-integrity objects; e.g., those with associated surrogates. Another option might be to search only high-integrity objects in a coarse granularity search; e.g., use Dublin Core type metadata as the surrogate for the search processing. This selectivity option will take some experimentation over time, but has the potential for being quite effective.
We believe, however, that the greatest potential for improvement to networked resource discovery lies in the use of dynamic, or derived, surrogates. Lynch, Michelson, et. al. refer to this capability with the comment "...it is important to recognize that the networked information environment offers new opportunities to derive (by extraction or computation) a much richer and more diverse set of surrogates from networked objects than the surrogates that were typically found in the print world."
We distinguish the notion of deriving surrogates from the essentially surrogate-free resource discovery tools (e.g., Alta-Vista) described earlier. Our intention is to preserve the order-making capacity of surrogates by developing a set of logical surrogate templates that model user requirements in the different stages of resource discovery. Research in this area can proceed with detailed user behavioral studies, both in the physical and networked realm. Research of this type is being undertaken as part of the NSF/ARPA/NASA DLI Projects [BISHOP] and in other venues [PAYETTE]. As a result of these studies we can enumerate key stages in the resource discovery process and develop detailed profiles of these stages. These profiles can then be used to develop a finite number of surrogate templates. With more experience these profiles can be refined and their number increased to allow finer granularity in surrogate/requirements matching.
One way of thinking of this approach is as an extension of the Warwick Framework. The original description of the Framework, in the Lagoze, Lynch and Daniel paper [WF], presented it as a container of physically distinct metadata packages. It is entirely consistent with the semantics of the Framework to move from physical metadata packages to logical metadata packages. In fact, this is by-and-large an implementation detail hidden behind a common interface that makes a set of metadata packages, either static or derived, available to a client.
The mechanisms for deriving these logical surrogates from either source intellectual content or other surrogates are the subject of current and future research. This could be done in a variety of manners. Most simply, it might involve the extraction of structured data based on the DTD and tags within an SGML document. A more interesting possibility is to derive structure [SUMMERS] from documents that are not explicitly tagged. Finally, there is research in both the information retrieval and natural language communities to derive summary information [SALTON] from documents.
An equally interesting research problem is client or user side mapping to the dynamic surrogates. This can be characterized as two problems. First there is the issue of tracking and modeling the current user requirements, presenting those requirements to a resource discovery tool, and then matching them to the appropriate surrogate template. Issues related to this are being examined as part of the user agent research within the University of Michigan Digital Library Project [UMDL]. Second, there is the issue of providing search user interfaces that adapt to the current requirements of the resource discovery process. One interesting examination of this area is the work in the University of Maryland HCI Laboratory [DOAN]. The increasingly ubiquity of Java will make it possible for increased research in both of these areas.
[KUNZE] John Kunze, A Citation Model for Resource Discovery and Retrieval, to appear in D-Lib Magazine
[OCLC] CNI/OCLC Workshop on Metadata for Networked Images - Executive Summary, http://www.oclc.org:5046/research/dublin_core/summary.html
[AACR2] American Library Association, Anglo-American Cataloging Rules, 2nd edition.
[MARC] Library of Congress, MARC Standards, http://lcweb.loc.gov/marc/marc.html
[LEVY] David Levy, Cataloging in the Digital Order, Digital Libraries '95, http://csdl.tamu.edu/DL95/contents.html
[XXX] xxx.lanl.gov e-Print archive, http://xxx.lanl.gov
[STEFIK] Mark Stefik, Letting Loose the Light: Igniting Commerce in Electronic Publication, in Internet Dreams Archetypes, Myths, and Metaphors, MIT Press, 1996
[SMITH] Terence R. Smith, The Meta-Information Environment of Digital Libraries, D-Lib Magazine, July/August 1996, http://www.dlib.org/dlib/july96/new/07smith.html
[ARCHIE] Alan Emtage and Peter Deutsch, Archie -- an Electronic Directory Service for the Internet, USENIX Winter 1992 Technical Conference Proceedings, http://www.bunyip.com/research/papers/1992/archie-usenix.ps
[DC] Stuart L. Weibel and Carl Lagoze, An Element Set to Support Resource Discovery: The State of the Dublin Core, to appear in Journal of Digital Libraries, Draft Copy available at http://www2.cs.cornell.edu/lagoze/papers/jodl.html
[W3C] Jim Miller, Paul Resnick and David Singer, Rating Services and Rating Systems (and their Machine Readable Descriptions), Platform for Internet Content Selection Version 1.1, May 1996, http://www.w3.org/pub/WWW/PICS/services.html
[PICSWG] Bob Schloss and Eric Miller, PICS 1.x Changes to support digital libraries, a talk at PICS WG meeting, January 1997, http://www.w3.org/pub/WWW/PICS/970113/DigiLib/pics970113.htm
[WF] Carl Lagoze, Clifford A. Lynch, and Ron Daniel
Jr., The Warwick Framework: A Container Architecture for Aggregating
Sets of Metadata, Cornell University Technical Report TR 96-1593,
[FGDC] Federal Geographic Data Committee, Content Standards for Digital Geospatial Metadata, http://geochange.er.usgs.gov/pub/tools/metadata/standard/metadata.html
[BISHOP] Ann Peterson Bishop, Working Towards an Understanding of Digital Library Use, D-Lib Magazine, October 1995, http://www.dlib.org/dlib/october95/10bishop.html
[PAYETTE] Sandra D. Payette and Oya Y. Rieger, Supporting Scholarly Inquiry: Incorporating Users in the Design of the Digital Library, to appear in Journal of Academic Libraries
[SUMMERS] Kristen Summers and Daniela Rus, Using Non-Textual Cues for Electronic Document Browsing, in Digital Libraries: Current Issues, Lecture Notes in Computer Science, Springer-Verlag 1995, http://www.cs.cornell.edu/Info/People/summers/segment.html.
[SALTON] Gerard Salton and Amit Singhal, Automatic
Text Theme Generation and the Analysis of Text Structure, Cornell Computer
Science Technical Report TR94-1438,
[UMDL] Michael P. Wellman, Edmund H. Durfee and
William P. Birmingham, The Digital Library As Community of Information
Agents, to appear in IEEE Expert, June 1996, http://ai.eecs.umich.edu/people/wellman/pubs/expert96.html
[DOAN] Khoa Doan, Catherine Plaisant, and Ben Scheiderman, Query Previews in Networked Information Systems, Technical Report CAR-TR-788, University of Maryland, September 1995, ftp://ftp.cs.umd.edu/pub/papers/papers/3524/3524.ps.Z