D-Lib Magazine
February 2000

Volume 6 Number 2

ISSN 1082-9873

The Santa Fe Convention of the Open Archives Initiative

Herbert Van de Sompel
Los Alamos National Laboratory - Research Library, New Mexico, US, and
Automation Department of the Central Library of the University of Ghent, Belgium

Carl Lagoze
Department of Computer Science
Cornell University

The Open Archives initiative (OAi) promotes and encourages the development of author self-archiving solutions (also commonly called e-print systems) through the development of technical mechanisms and organizational structures to support interoperability of e-print archives. Such interoperability can stimulate the transition of e-print systems into genuine building blocks of a transformed scholarly communication model. This paper describes the Santa Fe Convention of the OAi. This is a set of relatively simple but potentially quite powerful interoperability agreements that facilitate the creation of mediator services. These services combine and process information from individual archives and offer increased functionality to support discovery, presentation and analysis of data originating from compliant archives.


In July 1999, Paul Ginsparg, Rick Luce and Herbert Van de Sompel sent out a Call for Participation (Ginsparg, Luce, and Van de Sompel 1999a) to a meeting exploring cooperation among scholarly e-print archives. The meeting,  held in October 1999 in Santa Fe, and originally called the Universal Preprint Service meeting, led to the establishment of the Open Archives initiative (OAi) (Ginsparg, Luce, and Van de Sompel 1999b). The goal of the OAi is to contribute in a concrete manner to the transformation of scholarly communication.  The proposed vehicle for this transformation is the definition of technical and supporting organizational aspects of an open scholarly publication framework on which both free and commercial layers can be established.

This paper describes the origins of the OAi and work heretofore in defining this framework: the Santa Fe Convention. This convention is a combination of organizational principles and technical specifications to facilitate a minimal but potentially highly functional level of interoperability among scholarly e-print archives. The convention gives data providers -- individual archives -- relatively easy-to-implement mechanisms for making information in their archives externally available. This external availability then makes it possible for service providers to build higher levels of functionality, mediator services, using the information made available from scholarly archives that adopt the convention.

The growth of e-print archives

The origins of the Open Archives initiative lie in the growing number of electronic preprint (e-print) archives. While several of these began as informal vehicles for the dissemination of preliminary results and non-peer reviewed "gray literature", a number of them have evolved into an essential medium for sharing research results among the colleagues in a field.

These archives demonstrate a shift in the traditional scholarly communication model, which has relied on formally published scholarly journals. There is a growing consensus that the scholarly journal system is facing significant challenges:

  • The explosive growth of the Internet has given scholars almost universal access to a communication medium that facilitates immediate sharing of results.
  • The rapidity of advances in most scholarly fields has made the slow turn-around of the traditional publishing model an impediment to collegial sharing.
  • The full transfer of rights from author to publisher often acts as an impediment to the scholarly author whose main concern is the widest dissemination of results.
  • The current implementation of peer-review -- an essential feature of scholarly communication -- is too rigid and sometimes acts to suppress new ideas, favor articles from prestigious institutions, and cause undue publication delays.
  • The imbalance between skyrocketing subscription prices and shrinking or, at best, stable library budgets is creating an economic crisis for research libraries.

The e-print archives exemplify a more equitable and efficient model for disseminating research results. An important challenge is to increase the impact of the e-print archives by layering on top of them services -- such as peer review -- deemed essential to scholarly communication. This is the focus of the Open Archives initiative.

An exhaustive review of existing e-print archives is out of the scope of this paper. An interesting list of initiatives is available at the Office of Scientific and Technical Information. A brief review of some of the notable efforts is illustrative of the scope of these initiatives:

  •, hosted by Los Alamos National Laboratory, is considered the premier example of e-print archives. The archive was started in 1991 by Paul Ginsparg, who is internationally recognized as one of the leaders in the area of scholarly publishing alternatives. Over the past decade, the arXiv archive has evolved towards a global repository for non peer-reviewed research papers in a variety of physics research areas. arXiv has also incorporated mathematics, non-linear sciences and computer science.
  • CogPrints, hosted by the University of Southampton in the U.K., is modeled on arXiv and focuses mainly on papers in Psychology, Linguistics and Neuroscience.
  • NCSTRL (Networked Computer Science Technical Reference Library) is an international collection of computer science research reports. NCSTRL is based on a distributed model. Documents are stored in distributed archives and are made available through distributed services that communicate via the Dienst protocol.
  • NDLTD aims at building a digital library of electronic theses and dissertations (ETD) authored by students of member institutions. In ongoing research, NDLTD addresses issues such as the creation of a workflow to submit ETDs, the development of an XML DTD for ETDs and the support of a digital library for ETDs.
  • RePEc, an initiative in economics, also operates on a distributed model. It provides authors with the option to submit working papers to a departmental archive or -- if one does not exist -- to the EconWPA archive at Washington University. These archives support the so-called Guildford protocol that guarantees interoperability between the RePEc archives and has enabled the creation of a variety of end-user services.

There are indications that a growing number of disciplines and organizations are inspired by this pioneering work and are investigating alternative models for scholarly communication:

  • The NIH e-biomed proposal (Varmus 1999) for a more effective communication system for research reports in the life sciences demonstrates the innovative thinking inspired by initiatives like arXiv. While the PubMed Central environment (Anonymous 1999) (the system being developed as the outcome of the proposal) is more conservative than e-biomed, it remains faithful to the original desire to provide barrier-free access to primary reports in the life sciences.
  • The British Medical Journal and HighWire Press recently launched Clinical Medicine Netprints (Delhamothe, et al. 1999), an e-print site for studies, research, and articles in Clinical Medicine.
  • Under the umbrella of the eScholarship project, the California Digital Library is working on University ePub (Lucier and Ober 1999), a set of disciplinary e-print servers and services whose overall aim is to lead and support innovations in the production and dissemination of scholarship. The project received one of the grants from SPARC in the context of its Scientific Communities Initiative, which called for proposals introducing alternative communication methods as a way to address the serials crisis.
  • MIT plans to build a digital repository of which all public e-prints will be available to the whole e-print community.
  • Caltech's Scholar's Forum (Buck, Flagan, and Coles 1999) describes an alternative conceptual model for scholarly communication.

From individual archives to an interoperable fabric

The aim of the archive initiatives described above is to try to create a more effective scholarly communication mechanism that addresses problems that exist in the established system. The approaches that are taken by individual archives differ in a number of ways. Some initiatives build on a centralized model, others on a distributed departmental, or by extension, institutional model. Some deal with gray (non-peer reviewed) literature only, others incorporate metadata of peer-reviewed papers or try to establish some form of peer-review outside of the established system. Some deal with metadata only, others with both metadata and full content. Yet all share the attribute of offering scholars a vehicle to conveniently and immediately disseminate research results to peers.

The reason for launching the Open Archives initiative is the belief that interoperability among archives is key to increasing their impact and establishing them as viable alternatives to the existing scholarly communication model. This conviction is expressed in the official mission statement of the initiative:

The Open Archives initiative has been set up to create a forum to discuss and solve matters of interoperability between author self-archiving solutions (also commonly referred to as e-print systems), as a way to promote their global acceptance.

Interoperability is a broad term, touching many diverse aspects of archive initiatives, including their metadata formats, their underlying architecture, their openness to the creation of third-party digital library services, their integration with the established mechanism of scholarly communication, their usability in a cross-disciplinary context, their ability to contribute to a collective metrics system for usage and citation, etc.

Interoperability among archives offers substantial benefits to the scholars that use them. An important attribute of the traditional research library as an information provider is its role as a common entry point for a variety of information resources, not necessarily divided along disciplinary or institutional boundaries. The move from physical to digital sources should not be accompanied by the breakup of this entry point into a collection of fragmented archives. An increasing number of scholars move fluidly in their research across domain boundaries; the technology for delivering digital information should facilitate rather than hinder such fluidity. Mechanisms for interoperability offer the potential for discovery tools and virtual collections (Lagoze, 1998) that extend across the contents of multiple archives. Authors also benefit from such archive spanning tools, since their works will be accessible by a wider audience.

Interoperability is also beneficial to the archive and service provider. Rather than having to provide an entire suite of services for its users, individual archives can instead establish a well-defined interface on which external providers can build enhanced services. A variety of such services can be envisioned, including those that facilitate discovery, linking, and reviewing. An intriguing and essential set of services would be those that provide metrics to assist in the evaluation of the impact of certain scholarship and aid in tenure review and promotion decisions.

The Sante Fe Convention of the OAi represents a pragmatic, incremental, and collaborative approach towards interoperability. The initiators of the Open Archive initiative hope that this practical approach will be a catalyst for significant changes in the mechanisms for scholarly communication. The need for such change has been the issue of numerous papers, workshops, and Internet discussion groups. Yet, the existing system has proven somewhat resistant to change, no doubt due to the complex socio-political and economic forces that support it. For example, the current system of academic promotion and tenure is closely linked to the traditional journal system (Wilson 1942). This acts as an important factor sustaining the existing communication model (Schauder 1994). Understandably, scholars are hesitant to support alternative models that are not yet linked to their evaluation and promotion. While such issues will continue to support the current system, the development of practical technical and organizational solutions, such as the Sante Fe Convention, builds a framework for changes that will inevitably occur and may encourage the implementation of those changes.

Agreeing on interoperability: the Santa Fe meeting of the Open Archives initiative

A successful first meeting of the initiative was held on October 21-22, 1999, in Santa Fe, New Mexico. The meeting was sponsored by the Council on Library and Information Resources (CLIR), the Digital Library Federation (DLF), the Scholarly Publishing & Academic Resources Coalition (SPARC), the Association of Research Libraries (ARL) and the Los Alamos National Laboratory (LANL). The participants were computer scientists and digital librarians. There were also representatives of existing and emerging e-print systems, of scholarly publishers and of the sponsors. All but one of the invited institutions sent a representative. This was considered to be a firm indication of the perceived importance of the initiative.

The central theme of the first meeting was the establishment of recommendations and mechanisms to facilitate cross-archive value-added services. Such services could combine information derived from cooperating archives, process that information to produce some value-added information, and make that enhanced information available to users, agents, or other services. Examples of such services include cross-archive search engines, current awareness services, linking systems, and peer-review services.

Achieving progress on this goal required agreement among the participants on the issue of interoperability. Although interoperability has been a watchword for a variety of efforts in digital libraries and networked information (Paepcke, Chang, et al. 1998), the actual meaning of it and the implementation thereof has often proven elusive. Like many meetings intended to reach agreement on standards, attendees at the Santa Fe meeting arrived with a variety of pre-conceived notions on what was required to reach interoperability. It is instructive to review how these differing notions converged into a well-defined agreement that provides the foundation for cross-archive exchange of information.

The meeting began with a rather expansive example of interoperability, illustrated through the UPS Prototype project coordinated by Herbert Van de Sompel, Thomas Krichel, and Michael Nelson. This project and its results are described at length in the companion paper (Van de Sompel, Krichel, Nelson, et al 2000). Briefly summarized, the prototype demonstrated the integrated operation of a variety of services operating over data originating from a set of archives. Each of those services provided a reasonably rich level of functionality (implemented through a set of protocols).

There was general agreement among the participants at the meeting that the Prototype was an extremely useful demonstration of potential. There was also agreement, however, that trying to reach consensus on the full functionality of the Prototype was "aiming too high" and that a more modest first step was in order. The Prototype team, based on their insights gained during implementation of the UPS prototype, also reached a similar conclusion. This is described more fully in "Recommendations made to the Open Archive group" of (Van de Sompel, Krichel, Nelson, et al. 2000).

The remainder of the meeting was engaged in determining the proper degree of modesty, which balanced the need for adequate functionality against the requirement that the cost of entry for participating archives be sufficiently low. This is a question that has bedeviled other efforts at interoperability; for example, buy-in to the highly functional Z39.50 protocol has largely been limited to libraries, due to the costs of complexity (Stubley 1999). An important step towards establishing the cost/functionality balance was reached by the beginning of the second day with agreement among the participants on a tiered model of interoperability. This model is illustrated in Figure 1, showing the following layers:

  • Document Models - that address document structure and allow the specification of multiple disseminations (e.g., in multiple formats or of various structural decompositions) of a document instance. One example that addresses this level of interoperability is the Dienst repository protocol.
  • Metadata Harvesting - that enables the extraction of descriptive surrogates for documents. This approach was effectively demonstrated by the Harvest project (Bowman 1995) several years ago.
  • Mediator Services - that describe the nature of services that use and enhance information available from archives. The UPS Prototype (Van de Sompel, Krichel, Nelson, et al 2000) demonstrated a number of these services (e.g., linking) that build on top of the metadata harvesting layer. This service layer is also described in the digital library service model of (Leiner 1998).

Figure 1: a tiered model of interoperability

Framing the problem of interoperability with this model quickly led to the decision to restrict the Santa Fe recommendations to interoperability at the level of metadata harvesting. The mechanisms for establishing this interoperability, described in full detail in the Santa Fe Convention and summarized in the remainder of this paper, are three-fold:

  1. The definition of a set of simple metadata elements -- the Open Archives Metadata Set (OAMS) -- for the sole purpose of enabling coarse granularity document discovery among archives;
  2. The agreement to use a common syntax, XML, for representing and transporting both the OAMS and archive-specific metadata sets;
  3. The definition of a common protocol -- the Open Archives Dienst Subset -- to enable extraction of OAMS and archive-specific metadata from participating archives.

This agreement treats documents as black-boxes; archives can have idiosyncratic document representations with the Santa Fe Convention only specifying a URL entry point to the archives' individual document models. The question and functionality of common mediator services are left open to implementers who wish to exploit the Santa Fe Convention and build mechanisms based on it.

The Santa Fe Convention


The Santa Fe Convention presents a technical and organizational framework designed to facilitate the discovery of content stored in distributed e-print archives. It makes easy-to-implement technical recommendations for archives that - when implemented - will allow data from e-print archives to become widely available via its inclusion in a variety of end-user services such as search engines, recommendation services and systems for interlinking documents. In addition, the convention introduces an organizational framework for making information available about archives that adhere to the technical recommendations of the convention -- the data providers -- and about trusted parties that build end-user services for data originating from such archives -- the service providers. As such it provides a communication mechanism between providers of data and providers of services and creates a community of open archives.

Definitions and Concepts

The Santa Fe Convention builds on on a number of definitions and concepts that are essential for its understanding.

Open and managed e-print archives

The Convention considers the following to be crucial components of an e-print archive:

  • A submission mechanism;
  • A long-term storage system;
  • A management policy with regard to submission of documents and their preservation;
  • An open machine interface, that enables third parties to collect data from the archive.

The last item is crucial for enabling third parties to create services that support the discovery, presentation and analysis of data in the archive. Most e-print archives will also provide native end-user services. However, facilitating the broad dissemination of archive data through third party services is a crucial feature of an e-print archive. Therefore, the open interface is a key part of the Santa Fe Convention.

Data providers and service providers

Consistent with the objective of the Santa Fe Convention and the identification of the crucial functions of an e-print archive, there is a distinction between two participants in the convention:

  • A data provider is the manager of an e-print archive, acting on behalf of the authors submitting documents to the archive. As pointed out above, the data provider of an open archive will, at least, provide a submission mechanism, a long-term storage system and a mechanism that enables third parties to collect data from the archive;
  • A service provider is a third party, creating end-user services based on data stored in e-print archives. For instance, a service provider could implement a search engine for mathematical e-prints stored in archives worldwide.

Data in an e-print archive

The convention uses the notion of a record in an archive. Some archives may store metadata that describes full content without storing the full content itself. In this case, the metadata is a record. Other archives may also store full content. However, the convention assumes that if full content is stored, there will always be associated metadata stored in the archive as well as a mechanism to tie metadata and content together. In this case the combination of metadata and full content is a record.

Technical Components of the Santa Fe Convention

The complete details of the technical components of the Santa Fe Convention and instructions for participating are available via the core document. Organizations considering participation should refer to that document. This section summarizes the information for the purpose of an overview.

Open Archives Metadata Set

The Open Archives Metadata Set (OAMS) is a collection of nine metadata elements intended to facilitate coarse granularity resource discovery among the records in distributed and dissimilar archives. The semantics of this set have purposely been kept simple in the interest of easy creation and widest applicability. There is no provision for qualification or extension of the nine elements. The expectation is that individual archives will maintain metadata with more expressive semantics and the Open Archives Dienst Subset provides the mechanism for retrieval of this richer metadata.

Open Archives Dienst Subset

The Open Archives Dienst Subset is a set of protocol requests that are delivered via HTTP. This protocol is a subset of the full Dienst protocol. The protocol requests in the subset provide the following functionality:

  • List the full identifiers for records stored in an archive. An optional argument permits the client to specify that the list should only include records added after a specific date. Another optional argument allows the client to specify that the records should be accompanied by the metadata associated with the identifier.
  • Return the metadata for a specific record in a requested format.
  • Return the list of metadata formats supported by an archive.
  • Return the list of metadata formats available for a specific record.
  • Return the structure of the partitions by which an archive is organized.

All responses to these requests are formatted in XML.

Organizational aspects of the Santa Fe Convention

The convention also introduces an organizational framework to facilitate its implementation and to establish a communication mechanism between data providers and service providers. An understanding of this framework can be obtained from an exploration of the core document of the Santa Fe Convention that gives a step by step approach for making an e-print archive or a service comply with the Santa Fe Convention.

For the data providers, some of these steps are directly related to the implementation of the technical recommendations of the convention, as summarized in the previous section. In addition, the core document introduces the following important organizational elements:

  • The notion of a unique archive identifier to unambiguously represent each e-print archive, as well as the notion of a full identifier of a record in an archive. Since this full identifier is a concatenation of the unique archive identifier and the unique persistent identifier of a record in an archive, this full identifier will be persistent and globally unique.
  • The recommendation to document metadata formats other than the OAMS, as well as a facility to share this documentation with data providers and service providers.
  • A facility to register an e-print archive as being compliant with the Santa Fe Convention, by means of the provision of a filled-out version of a data provider template that describes crucial characteristics of the archive. Important information to be provided in this template is the unique archive identifier, the metadata formats implemented by the archive and URLs of the Dienst interface of the archive. The template also provides a means to provide information on the content of an archive, its submission policy and contact addresses. In addition, the template gives data providers a means to express the terms and conditions of usage of archive data.
  • A list of e-print archives that comply with the Santa Fe Convention, from which links are available to the documents describing their crucial characteristics.

For the service providers, the steps described in the core document of the Santa Fe Convention introduce the following:

  • The request to maintain the original full identifiers of the records harvested by the service provider.
  • The request to comply with the terms and conditions that data providers have brought forward in their filled-out data provider template.
  • A facility to register a service as being compliant with the Santa Fe Convention, by means of the provision of a filled-out version of a service provider template that describes aspects of the service. Amongst others, the service provider must mention from which archives information is being harvested as well as the fact that harvesting is compliant with the terms and conditions expressed by the data providers.
  • A list of services that comply with the Santa Fe Convention, from which links are available to the filled-out templates that describe them.

Conclusions and future plans

The technical results of the Santa Fe meeting may be perceived as quite modest, and indeed they are. However, the technical moderation should be viewed in a broader context. First, it played an important role in bringing the Santa Fe meeting to a successful conclusion, with agreement among diverse parties. This agreement amongst a core group is an important step towards the development of a broader e-print community with a strong focus on cooperation and interoperability. The organizational framework provided by the Santa Fe Convention is intended to actively contribute to the creation and extension of such a community. Second, the limited nature of the technological requirements lowers the cost of entry for new participants, and hopefully builds momentum for the development of scholarly publishing alternatives. This momentum will provide a basis for future agreements that may extend and enhance the current Santa Fe Convention.

If successful, the Convention will attract early adoption by existing archives and encourage the establishment of new scholarly archives that will support the mechanisms defined by the Convention. The former, early adoption, seems to be occurring with participants at the meeting representing arXiv, the California Digital Library, clinmed, CogPrints, RePEc and NCSTRL, stating their intention to comply with the Santa Fe Convention in the near future.The CogPrints team at Southampton also work on the implementation of a free software for e-print archives that will comply with the Santa Fe Convention (Harnad 1999). Based on the number of inquiries received since the Santa Fe meeting, there are reasons to be optimistic regarding the establishment and adoption by other existing and planned archives. Positive feedback has been received from representatives of German mathematical and physical e-print archives. In addition, several commercial and non-commercial parties have expressed interest in creating mediator services once archives have implemented the convention.

The current challenge for the Open Archive initiative is to maintain a focus on the successful dissemination and implementation of the Santa Fe Convention. Before considering whether it is necessary or appropriate to expand the nature of the interoperability agreements, it is essential that the mechanisms described in the current convention be widely implemented and tested in practice. Without such proof of concept, the initiative may find itself increasing the complexity (and cost of implementation) of the interoperability mechanisms without discovering if, in fact, the level of interoperability defined by the existing Santa Fe Convention is sufficient and practical. Any future work to expand the scope of the OAi should understand that the success of any interoperability standard must be measured relative to both its functionality and its cost of adoption (Arms 2000).

The near-term plans for the Open Archive initiative include public dissemination of the Santa Fe Convention scheduled for February 15, 2000, and meetings to review progress and chart future activities. This paper represents the initial public dissemination and the Open Archives web site will serve as a persistent and official record of the convention. The next meeting will take place at ACM Digital Libraries 2000 in San Antonio, Texas, in June 2000. The exact dates and place of this meeting will be posted on the Open Archives web site nearer to the June date. A European meeting is tentatively planned in conjunction with ECDL 2000 in Lisbon, Portugal, in September 2000.


The authors wish to thank:

  • All the participants at the Open Archives meeting in Santa Fe in October, 1999. The hard work of all of these people made the results described here possible.
  • Clifford Lynch and Don Waters for effectively chairing the Santa Fe meeting.
  • Caroline Arms, Mark Doyle, Ed Fox, Paul Ginsparg, Thomas Krichel and Michael Nelson for their contributions to the Santa Fe Convention.
  • CLIR, SPARC, ARL, and the LANL Research Library for financial and moral support without which the Santa Fe meeting would not have been possible.
  • Donna Berg of the LANL Research Library for the perfect organization of the meeting.

Herbert Van de Sompel wishes to thank the Belgian Science Foundation for a special Ph.D. grant.

Work on Dienst and the Open Archives Dienst Subset is supported by the National Science Foundation Grant No. IIS-9817416 and Defense Advanced Projects Agency Grant No. N66001-98-1-8908, with the Corporation for National Research Initiatives.

Copyright 2000 Herbert Van de Sompel and Carl Lagoze

(The link for clinmed was corrected to <> on 2/22/00 at the author's request.)

DOI: 10.1045/february2000-vandesompel-oai