The Dublin Core and Warwick Framework

A Review of the Literature, March 1995 - September 1997

Harold Thiele
Department of Library and Information Science
School of Information Sciences
University of Pittsburgh
Pittsburgh, Pennsylvania


The purpose of this essay is to identify and explore the dynamics of the literature associated with the Dublin Core Workshop Series. The essay opens by identifying the problems that the Dublin Core Workshop Series is addressing, the status of the Internet at the time of the first workshop, and the contributions each workshop has made to the ongoing discussion. The body of the essay describes the characteristics of the literature, highlights key documents, and identifies the major researchers. The essay closes with evaluation of the literary trends and considerations of future research directions. The essay concludes that a shift from a descriptive emphasis to a more empirical form of literature is about to take place. Future research questions are identified in the areas of satisfying searcher needs, the impact of surrogate descriptions on search engine performance, and the effectiveness of surrogate descriptions in authenticating Internet resources.


The literature associated with the Dublin Core Workshop Series is of recent origin having started in 1995. It focuses on the development and promotion of metadata elements that facilitate the discovery of both textual and non-textual resources in a networked environment and support heterogeneous metadata interoperability. The object is to develop a simple metadata set and associated syntax that will be used by information producers and providers to describe their networked resources, thereby improving their chance of discovery.

Background: Information Discovery, Search, and Retrieval

The Internet's rapid expansion following the introduction of the World Wide Web (WWW) and the Mosaic WWW client/browser in 1993 (Poulter, 1997) was not accompanied by an equally radical change in the way the net was searched. Rather, like Gopher before it, the WWW depends upon two main classes of Internet search engines - keyword and subject directory. The development of robot programs to copy the contents of webpages back to a central site for indexing is magnifying the retrieval problems because the HTML protocol does not have mandatory resource description sections. Underscoring the magnitude of the resource description problem, Bray's (1996) November 1995 survey and quantitative description of the web identified 11,366,121 unique gopher, ftp, and http URLs with an average page size of 6500 bytes. Bray observed that page size has remained consistent since the start of the Open Text Index, but the number of pages is increasing dramatically. Using a different experimental design, Woodruff and others (1996), in their examination of 2.6 million HTML (http) documents retrieved by the Inktomi Web crawler in November 1995, reported an average size of 4400 bytes for their sample. In contrast to Bray's observations, which included the more mature gopher and ftp sites, Woodruff and others commented that the properties of the HTML documents were changing exceptionally quickly, especially in increasing page size and the URLs' inability to persist for an extended length of time. They agreed with Bray that the number of pages is increasing dramatically, finding that the size of the Internet seemed to double between October and November, 1995, going from 1.3 million to 2.6 million HTML documents.

Growth in size and heterogeneity represents one set of challenges for designers of search systems. A second set of challenges arises from searchers' behavior. Recent studies have shown that users have difficulty in finding the resources they are seeking. Using log file analysis, Catledge and Pitkow (1995) found that users typically did not know the location of the documents they sought and used various heuristics to navigate the Internet, with the use of hyperlinks being the most popular method. They also found that users rarely cross more than two layers in a hypertext structure before returning to their entry point.

The Dublin Core Workshops

It is against this background that the Dublin Core Workshop Series has provided the catalyst and direction for the development of the literature. Employing a multidiscipline approach and focus group methodology to develop consensus, each of the workshops has contributed to the refinement of the arguments and redirection of the research as described below.

Characteristics of the Literature

In the two and one half years since DC-1, a vast and highly varied literature has developed around the ideas and concepts emanating from this ongoing series of workshops. The literature is primarily available through online sources. Where a print source is available, there is usually an electronic counterpart. Papers published as part of conference proceedings or transactions are generally available in both electronic and print format. The advantages and disadvantages associated with this emphasis on electronic publication are that the articles are subject updates at frequent intervals.

The most popular outlet for publishing information about the Dublin Core and Warwick Framework is the D-Lib Magazine (ISSN: 1082-9873) []. The second most popular outlet is Ariadne (ISSN: 1361-3200) [URL:]. The central resource page for workshop publications and web pages is The Dublin Core Metadata Element Set Home Page <URL:>. This site maintains current links to workshop homepages, resources, and products, which is very important considering the emphasis on electronic publishing associated with this research area. Additional electronic resources are located at the various Digital Library and Metadata project websites that are experimenting with the use of the Dublin Core and Warwick Framework. See Hakala (1997) for a listing of recent project reports.

The literature easily separates into several distinct clusters. The first cluster, which forms the central literary core, contains the proceedings and reports from the various Dublin Core workshops. This literature grows from the description of the basic elemental metadata set (Weibel and others, 1995) and the container architecture and syntax (Dempsey and Weibel, 1996) to grappling with extension development, and international and multidiscipline implementation (Weibel, Iannella, and Cathro, 1997). This cluster presents the findings from the various workshops, the consensus reached and areas yet unresolved, as well as the direction of future efforts. At this point in time, most of the literature from these workshops can be characterized as descriptive or broadly conceptual in nature.

The second literary cluster focuses on crosswalks or mapping the Dublin Core Metadata Elements to other metadata systems. While much of this literature consists of attempts to make simple one-to-one correlations without discussion of problems involved, there are a few efforts that have provided more insight. Caplan and Guenther (1996) explored the difficulties in mapping a syntax-independent format (Dublin Core) against a highly syntax-dependent format (USMARC) and concluded that for successful machine mapping either generic fields will have to be added to the MARC scheme or the Dublin Core will have to be made more complex. As part of a project to identify a minimal searchable metadata set, an extensive crosswalk matrix was prepared for the Federal Geographic Data Committee (FGDC) by the MITRE Corporation. The Dublin Core was used as the reference set for comparison with eight different metadata systems and identification of equivalent and non-equivalent metadata elements. (MITRE Corporation, 1996). The Library of Congress (LOC) developed a crosswalk between the Dublin Core, MARC, and GILS that provides very specific MARC record formatting information (Library of Congress, 1997a). As part of this crosswalk, the LOC provided detailed rationale for their decisions. Where options were available for different possible mappings, the various options were fully explored so that accurate conversions can be made as the use of the Dublin Core becomes more widespread.

A third literary cluster revolves around how key standards making organizations are reacting to and incorporating the Dublin Core into their standards. As with the workshops, the methodology used is consensus building. The two major standards organizations concerned with Dublin Core issues at this time are the Library of Congress and the Internet Engineering Task Force (IETF). As part of their traditional methodology of gathering information and building consensus, the Library of Congress produced a series of discussion papers dealing with the integration of the Dublin Core Metadata Elements into the MARC record format. MARBI Discussion Paper No. 88 (Library of Congress, 1995) grappled with the problem of defining a generic author field that corresponded with the Dublin Core author element. This exercise resulted in formal approval of the MARC record 720 field in 1996. MARBI Discussion Paper No. 99 (Library of Congress, 1997b) builds on their earlier experience and the new revisions to the MARC record to provide a revised crosswalk and commentary on resolving ongoing mapping difficulties.

Two internet-drafts, the first stage in the IETF standards process, have been produced for consideration by the IETF. Building on their workshop experience and feedback from the various stakeholder communities, an internet-draft by Weibel, Kunze, and Lagoze (1997) provides a description of the Dublin Core and its semantics. Utilizing their previous work in directory related metadata research, a second internet-draft prepared by Hamilton, Iannella, and Knight (1997) provides a mapping from the Dublin Core to the X.500 and [C]LDAP directory service protocols by treating the Dublin Core elements as an object class of X.500/[C]LDAP attribute/value pairs. (Both X.500 and [C]LDAP define protocols for accessing information directories. The X.500 is designed to deal with all forms of telecommunication systems. The [C]LDAP, while based on the X.500, supports TCP/IP and was developed for Internet use (Kille, 1996; Shuh, 1997).) Two additional internet-drafts, (Musella, 1997; Daviel, 1997) provide examples of how the META tag may be used in HTML documents to provide cataloging and resource identification information.

A fourth literary cluster is generated by ongoing Digital Library and/or Metadata projects around the world that are examining or actively incorporating the Dublin Core into their activities. The general formats used are technical reports or white papers that describe the conceptual approaches being used. At this point in time, the few empirical results reported are generated by prototypes and testbed exercises. Miller (1996) discusses an extension to the Dublin Core that describes Archaeological resources collected by the Archaeology Data Service (ADS). The expansion of the Dublin Core to describe non-textual information and the inclusion of the <LINK> tag to identify every metadata item with its reference description is a key part of this report. Godby and Miller (1997) describe a tool, the Spectrum Cataloging Markup Language (SCML), that can be used to extract information from structured records, implement extensions to the Dublin Core element set, and generate Dublin Core records.

The fifth literary cluster comprises articles and papers published in journals and conference proceedings that don't fit into the other four categories. It is in this cluster that many of the articles and studies using comparative and experimental research methodologies are reported, as well as the descriptions of smaller projects incorporating the Dublin Core. Desai (1997) compares the Dublin Core Elements List and the Semantic Header, and concludes that the Semantic Header is better suited for resource discovery on the Internet because it supports a more systematic approach to indexing information. Karttunen, Holmlund, and Nowotny (1996) describe how the Internet Pilot to Physics (TIPTOP) incorporated the Dublin Core as a critical component of this uniform and open information infrastructure for physics research and education.

Key Documents

There are a few documents that stand out from the rest of the literary field because of the completeness of the reporting or the importance of the concepts being discussed. First, there are two excellent descriptive studies that review the many metadata formats currently available. Heery (1996a) compared five metadata formats (IAFA/Whois++, MARC, Text Encoding Initiative, Dublin Core, and Uniform Resource Characteristics) against a set of five criteria in the context of bibliographic control. She provides an excellent description of the criteria, and follows it consistently in her comparison of the five formats. She also provides historical background, an example of a completed record where possible, and detailed description with each of the criteria for each of the formats. Heery concludes that while the constituencies promoting the various metadata formats are acknowledging the need for simplified records, there has been little progress towards rule simplification for content or for consensus on the degree of semantic structural complexity required. She closes by stating that any successful metadata format will need to incorporate flexible change procedures and also have the ability to deal with legacy systems.

The second review of metadata formats is a product of the Development of a European Service for Information on Research and Education (DESIRE) project funded by the European Union. Dempsey and others (1997) examined 22 metadata formats from the point of view of metadata for information resources. The formats were distributed amongst three typological bands based on formats, standards, and other characteristics. Extensive commentary is provided for each metadata format. The object of the study was to provide background information on each of the formats so that the implications of their use could be assessed. The authors argued that the Dublin Core should remain optimized for its target use as a simple resource description format linked with the Warwick Framework that is used to aggregate metadata objects and facilitate their interchange. Based on the information in this report, Heery (1996b) included the Dublin Core as one of the four metadata formats recommended for additional investigation by the DESIRE project.

A second set of documents deals with the development of the metadata container architecture referred to as the Warwick Framework. This set of documents provides one of the few illustrations of the progression from theory to implementation in this body of literature. Kahn and Wilensky (1995) provided the theoretical structure for the development of the Warwick Framework. The concept of the distributed digital object is defined and a method for naming, identifying, and invoking the digital object is described. The concepts are presented in a very conceptual and abstract fashion, and the content-based aspects of the infrastructure are purposefully not addressed.

Lagoze, Lynch, and Daniel (1996), building on Kahn and Wilensky's theoretical work, describe the container architecture of the Warwick Framework that is to be used to aggregate logically distinct packages of metadata. The Warwick Framework has two distinct components, the container that aggregates the metadata sets and the metadata sets, i.e., the packages. This modular architecture allows the aggregation of containers within other containers, where they are treated as packages. Among the issues to be resolved are: the semantic overlap between packages; the need for a metadata type registry; the requirement for some form of interactive container syntax; the efficiency of distributed architecture; and repository access protocols.

Building on this description, Knight and Hamilton (1997a) describe implementations of the Warwick Framework using the Multipurpose Internet Mail Extensions (MIME) [Internet RFC-1522]. They conclude that MIME is suitable for the encapsulation and transport of metadata as well as data, and meets all the requirements of the Warwick Framework. MIME has already in place a large body of code and implementation experience, a central type registry, and is being updated to allow the use of Unicode [ISO 10646] character sets.

A third set of documents addresses the Platform for Internet Content Selection (PICS) efforts to demonstrate how the Dublin Core and Warwick Framework can be integrated into this proposed Internet standard. The growing interest in the Dublin Core and Warwick Framework as a mechanism for transporting resource descriptive information is illustrated by the efforts from both the Dublin Core research group and the PICS research group to include PICS content descriptive values in the Dublin Core Metadata set. Iannella (1997) showed how the Dublin Core could be extended to accommodate the PICS extension mechanism by defining the rat-inherit and sub-label extensions.

Braun, König, and Wichmann (1997), building on their work with the PICS-SE standard, propose a slightly different approach that does not require introducing changes in the PICS syntax. This is accomplished by defining a set of PICS-SE classes for the Dublin Core by making use of the Knight and Hamilton (1997b) Dublin Core qualifiers.

Key Researchers

Most of the research and developmental work in this area is associated with a small number of researchers whose names always seem to appear whenever the Dublin Core or Warwick Framework are mentioned. The most prominent name that surfaces whenever the Dublin Core is mentioned is that of Stuart L. Weibel, a Senior Research Scientist with the Office of Research and Special Projects at OCLC. A name that often appears when the Warwick Framework or container architecture is mentioned is that of Carl Lagoze, head of the Cornell Digital Library Research Group at Cornell University. A third name that also appears quite often in the literature is that of Renato Iannella, a Senior Research Scientist at the Distributed Systems Technology Centre in Brisbane, Australia.

Future Considerations

Reflection on the current state of the literature and the general trends developing in the related research areas indicate that a crucial period is approaching. Most of the literature up to this point has been of a descriptive nature. Now that efforts are turning towards implementation of the Dublin Core and Warwick Framework, the emphasis needs to shift to a more empirical form of research. The proposed research falls into three general topics: behavioral, technical, and sociological.

On the behavioral side, areas where research needs to be pursued include expanded user studies on how effective the Dublin Core actually is in comparison with other metadata concepts in satisfying searcher needs. A second question that user studies should address is how effective or efficient are surrogate descriptions in improving precision ratios in the retrieval activity for searchers in a very large networked environment.

On the technical side, research should be done into what effect surrogate descriptions like the Dublin Core have on the improving cache performance in the search process and on reducing the bandwidth problems associated with the indexing of the Internet. A second technical question that needs study is whether or not surrogate descriptions like the Dublin Core favor centralized indexing search engines like Altavista over non-centralized indexing engines like Harvest.

On the broader sociological issues, one long range question that should be considered is whether or not this form of creator-based surrogate descriptive indexing will split the resources available on the Internet into two distinct groups. One group of resources will be associated with the traditional academic and research paradigms that will employ the surrogate descriptions. A second group of resources will be generated by individuals who are not part of the academic and research paradigm, and surrogate descriptions will not be employed. Related to this is the question of whether or not the use of a surrogate descriptions will act as authenticating or validating mechanisms for Internet resources.

In conclusion, the research in this area seems poised to switch its orientation from a predominately descriptive approach towards a more empirical one. The implementation of the Dublin Core on a wider scale provides opportunities for new questions to be asked and additional research methodologies to be employed.


