Repository Interoperability Workshop

Towards a Repository Reference Model

William L. Scherlis
Carnegie Mellon University
Pittsburgh, Pennsylvania
[email protected]

D-Lib Magazine, October 1996

ISSN 1082-9873

Repository Interoperability

The rapid proliferation of digital libraries and digital information resources creates greater challenges for digital library users in locating and accessing particular library assets. How can a user gain "transparency of access" to a broad range of digital information resources? This proliferation of information resources also creates significant challenges for intellectual property owners, who are entrusting networked information resources such as digital libraries to oversee increasingly valuable information assets. How can a library user interact through a single interface with an aggregate of diverse digital library facilities, and in a manner that provides assurance that the rights-in-data of intellectual property owners are protected?

Interoperation is a principal challenge in digital library research. The workshop reported here focuses on an aspect of interoperation that is beginning to receive increasing attention; this is the management of actual digital library information assets. Other dimensions of interoperability include, for example, bibliographic metadata, support for browsing, management of payments and related terms and conditions, access to task-specific meta-data, and annotation. For example, the July/August D-Lib Magazine included several articles devoted to the difficulties of identifying areas for commonality among an increasingly diverse array of meta-data. These commonalities provide a means to support effective resource description and search.

A sign of maturation of digital library technology and its application is the greater value of information assets being managed. In addition, there is increasing diversity and complexity of kinds of assets. This maturation suggests new priorities for digital library interoperation technology development. For example:

Effective asset monitoring and protection. Intellectual property owners require assurances that rights-in-data are being respected by the libraries. Libraries have certain rights to use an asset, but they rarely own the intellectual property embodied in it. This demands a certain management flexibility, obviously, since so many definitions remain in flux (for example, the notion of "fair use"). Also, as library and commercial roles become more intertwined, a greater variety of kinds of rights are appearing.

Multiple modes of presentation. A particular information asset can present many faces to users. For example, the hardware and software platform for presentation may determine how a particular multimedia asset will need to be encoded for a user to have useful access. These characteristics, plus properties of the communication channel, as well as privileges and resources of a user may determine, for example, resolution, precision, and other characteristics of presentation of an image, video, or other asset.

Repositories

These changes in the environment motivate explicit consideration of the most effective means for digital libraries to interoperate at the level of asset management. This kind of interoperation is referred to as "repository-level interoperation" to distinguish it from, for example, efforts focused primarily on meta-data. Repository interoperation thus deals primarily in actual digital library information assets, and so may initially seem straightforward, certainly as compared with the interoperation challenges faced at the level of domain-specific meta-data. But there are considerable challenges. Here are a few examples:

Trust. The commonalities that are established must assure that rights-in-data of the actual intellectual property are protected, and that responsible libraries have some means to verify this. How can this be accomplished in a way that scales up to large numbers of cooperating libraries? (This challenge of trust motivates the use of the term "repository," but it should be clear that repositories are dynamic and interactive, not passive and static.)

Definition of an asset. Digital assets can be richly structured, containing many subsidiary assets and linked to other assets or parts of them. Assets can be structured documents that include software components, changing objects, internal and external links, as well as more traditional kinds of passive data. They also can have many distinct representations, tailored to specific usage environments or degrees of access by users. In some cases, assets could be ephemeral. What is an appropriate concept of asset?

Performance. While an asset may itself be persistent, the use of the asset can be limited and ephemeral. The terms and conditions associated with the rental of a commercial videotape, for example, may permit playing that tape at home for non-commercial purposes. But they usually do not permit copying, commercial presentation, or other kinds of uses. In addition, the videotape medium allows certain operations to be performed easily (corresponding to VCR controls), but others are more difficult (search, restructuring, and summarization). Associated with the concept of information asset, therefore, is a concept of performance, which concerns how the asset may be used.

Naming. What are appropriate means for assigning persistent names to objects (or portions of objects) within libraries, so they can be referred to externally and in meta-data? How, for example, are different performances or representations of an asset named? How are portions of an asset or aggregations of assets given names? How can multiple names for a given asset (e.g., in multiple repositories) be reconciled? How can names be managed when assets are moving, ephemeral, or mutating?

Types and interpretation. Text documents may be encoded as objects managed using commercial word processing programs. They can be part of structured documents that include computational elements (such as software applets that implement interactive graphics). Type information associated with an asset indicates how an otherwise-opaque bit stream that represents the asset is to be interpreted. More types are emerging, and they are controlled by diverse entities, with their structure often proprietary. How can this type information be managed in a distributed and persistent manner?

Many of these problems are familiar to information technologists in other contexts, suggesting that existing research and solutions can be exploited to support digital library applications. For example, distributed object mechanisms such as CORBA and OLE provide means for distributed management of objects and type information. With respect to naming, most digital library researchers are aware of the ongoing discussion in the World Wide Web community about URLs and URNs. The issues for the digital library community are how to assimilating these emerging solutions, matching them to the particular challenge of managing digital library information objects. An appropriate repository framework could enable exploitation of these technologies while providing a scalable approach to digital library interoperation at the level of information objects.

The CNRI Exploratory Workshop

To understand this challenge, the Corporation for National Research Initiatives (CNRI), as part of the D-Lib program, convened an exploratory Repository Interoperability Workshop in March 1996. This workshop, held in Reston, Virginia, brought together a group of about 20 researchers, with the intent of better defining the issue and the understanding the challenges associated with it. The workshop consisted of four parts: a review of related research efforts, an identification of issues, three separate working groups to explore the repository concept and consider approaches to interoperability, and a plenary discussion to coalesce results and assess consensus. Preliminary results were presented at the March 1996 ACM Digital Library Conference. The sections below summarize some of the points raised at the workshop and the resulting conclusions. However, this report has not been coordinated with all attendees.

A Note on Interoperability, Reference Models, and Architectures

As noted above, there are many dimensions of digital library interoperation, including user interaction, search and presentation, more general meta-data, and asset management. Achieving a workable engineering approach to interoperation entails understanding the specific utility to be provided by the aggregate service. This leads to an identification of specific points of commonality (for example, the Dublin Core metadata elements) and diversity (task-specific metadata elements not covered in the Dublin Core).

To this end, much of the workshop activity was focused on identifying some specific elements of a reference model for repository function. A reference model is a conceptual framework that identifies characteristics of repositories that need to be common in order to achieve interoperation, but which does not specify a particular implementation approach or a system interface. At a later stage, there may be agreement (or not) on the details of specific system interface elements or overall architecture, but note that this is not as important for interoperation as the reference model itself: Consistency with the reference model assures feasibility of interoperation, though potentially elaborate wrappers and mediators may be required in an actual implementation.

Summary of Results

Repository Service Interface. There was a strong consensus at the workshop that the concept of a digital library repository should be defined in terms of a Repository Service Interface. That is, the repository function is defined in terms of requirements on the protocol for interaction with a client, rather than in terms that are more prescriptive and implementation-oriented. At this early stage of technology development, it is unacceptable for repository interoperability requirements to overly constrain the range of digital library implementation choices. The Repository Service Interface concept enables a separation of decisions regarding repository architecture and implementation from decisions concerning base functionality, and thus permits accommodation of a variety of new and legacy approaches to repository implementation.

A Layered Model. The next issue is where to place the Repository Service Interface in the "hierarchy" of function from raw storage management to "full library function." There was agreement on two points. First, the Repository Service Interface operates above the level of the privileged storage management operations that store, retrieve, and delete individual objects. These low level operations need to be privileged since they provide full capability to alter and access the contents of a collection. A Repository Service Interface would require a client to present explicit access tickets before a "performance" of an object can be delivered. This implies that the repository must itself be a trusted entity, operating at a layer higher than that of raw storage management. Thus, the Repository Service Interface represents a kind of "fiduciary interface" that encapsulates the core of trust that asset holders place in a digital library (or other network information system), separating that core from other value-added services.

The second point of agreement concerned meta-data and traditional library functions. A repository should interpret only those meta-data elements that directly pertain to functions such as object storage, object performance, rights-in-data, and terms and conditions for access. For example, meta-data relating to cataloging, search, location, and other traditional library functions is not interpreted at the repository layer. That is, this latter kind of meta-data is stored in a repository as another class of object. This separates repository function, concerning management of access to and performance of objects, from higher level library functions relating to cataloging, search, browsing, and so on. The meta-data that supports these functions is managed in repositories, but as independent objects with their own separate access- and performance-related meta-data. Thus, the repository operates at a lower level than library functions supporting search and presentation.

These considerations lead to a functional model with four layers: (1) A bottom layer that supports storage management and operates in a privileged mode. This layer would provide the usual features associated with data management systems, such as support for availability, reliability, persistence, versioning, and so on. (2) A trusted repository layer that supports client access to objects based on permission tickets and service requests. (3) A library functional layer or layers that support search and other library functions (for example, z39.50 service, though z39.50 also includes some elements of layer (2)). (4) A user or client layer corresponding, for example, to a z39.50 client. This model identifies the repository as a well-defined island of managed information, potentially corresponding to a legal entity with respect to its job of assuring respect for the rights-in-data of the intellectual property owners associated with the managed objects.

Repository functions. There was limited consensus at the workshop concerning the specific operations supported by the Repository Service Interface. Several efforts in this area ( Kahn and Wilensky; Lagoze and his collaborators at CNRI, NCSA, and Cornell; Garcia-Molina, Winograd, Paepcke, et al..; Arms; among others) are developing abstract models that would contribute to a Repository Service Interface definition. The Kahn and Wilensky work, in particular, draws a sharp distinction between layers (1) and (2) and the outer layers. Workshop results concerning concepts for objects, names (handles), performances, and service requests generally follow the results of these efforts.

A service request to the Repository Service Interface, for example, would include an object handle (a unique opaque identifier), an access ticket (that embeds information about client privileges), and a service request (specifying a particular presentation for an object). The result would be a particular performance (or "dissemination") of the object. Meta-data relating, for example, to content, interpretation, and interlinkage of objects is stored but not interpreted at the repository level.

It is important to note that agreement on a reference model for the Repository Service Interface does not force definition of a specific repository-level protocol. Indeed, it is possible (though not necessarily desirable) for there to be a multiplicity of repository-level protocols. But agreement on a suitably defined reference model would enable interoperation, even if through a set of wrappers (i.e., repository proxies) and/or mediators and other aggregation points.

Recommendations

While many issues were left open in the discussion (such as mutability of objects, details concerning terms and conditions, details of the reference model, and so on), several clear recommendations emerged:

Interoperability. Users, librarians, intellectual asset owners, and other stakeholders all benefit from interoperation among digital libraries.

Layered Model. The interoperation problem has many facets, many of which are already being addressed. But if high-value assets are to be shared, then an approach to interoperation at the repository level needs to be identified. The repository operates as a trusted entity.

Repository Reference Model. Many repositories are already in operation, and there are many architectural and implementation approaches being explored for digital libraries and the repositories they contain. Therefore, repository-level interoperation will not come about in the near term through adherence to a single specific repository-level protocol. A reference model, however, can provide a conceptual basis for describing and comparing repositories and can lead to common concepts.

Repository Service Interface. The most important common concept is the Repository Service Interface, which is a set of requirements for the protocol by which a client interacts with a repository. In this context, the "client" includes the outer layers of the digital library model. Since different repositories may have different realizations of the Repository Service Interface, interoperation would be accomplished through wrappers. (This approach is already being explored in several of the major digital library research projects.)

Experimentation. Existing digital libraries and digital library research efforts may benefit from exploring commonalities that might lead to a common reference model.

Working Group on Terms and Conditions. Repositories interpret meta-data pertaining to rights-in-data, performance, and associated terms and conditions. These meta-data enable a repository, for example, to assess what kinds of performances can be granted to a requester on the basis of the specific access tickets that are presented. Many kinds of information assets are now being managed using digital library technology, with many distinct traditions of management of rights-in-data. A working group should be initiated to assess how terms and conditions associated with information assets can be represented and managed at the repository level and beyond.

hdl:cnri.dlib/october96-scherlis