Stories

Spacer  

D-Lib Magazine
January 2000

Volume 6 Number 1

ISSN 1082-9873

DFAS

The Distributed Finding Aid Search System

Spacer Line
Spacer

MacKenzie Smith
Digital Library Projects Manager
Office for Information Systems
Harvard University Library
mackenzie_smith@harvard.edu

Spacer Line
Spacer

Abstract

EAD-encoded finding aids are proving to be a significant part of the metadata strategy of the emerging digital library. During 1998-99, the Digital Library Federation underwrote a project proposed by the University of Michigan and Harvard University to develop an automated system for distributed online searching of EAD-encoded finding aids. The participating institutions were, in addition to Michigan and Harvard, Columbia University, Indiana University, and Oxford University. The project, known as the Distributed Finding Aid Server (DFAS), was completed in July of 1999 with the publication of a final report. This article summarizes the project and discusses some of the major issues that were identified by it.

Background

Finding aids are textual documents that describe archival collections. These documents have existed for hundreds of years, and have taken many forms during that time. With the growing dependence on automated systems for managing access to archival collections, the need for standardizing finding aids across institutions and internationally has become increasingly evident. But historically no such standards existed, and today there is only tentative agreement on what these documents should contain and how they should be arranged. The archival community does not universally subscribe to the use of controlled vocabulary for such things as personal, corporate and geographic names, subject and genre terms, etc. (all things which are taken for granted in the arena of library bibliographic data such as USMARC). Nor is it obvious how use of controlled vocabulary would best be done in relatively unstructured, narrative documents, as opposed to highly structured, fielded cataloging records. ISAD(G) is a relatively recent standard for archival description and appears to be the way in which future standardization lies. But even should such standards be agreed upon now, there would still be a huge number of older finding aids which don't conform to them, and which would be prohibitively expensive to "reprocess" to the new form. At Harvard alone, we estimate that there are 14,000 older printed and manuscript finding aids, many for the most prominent collections at the university, and that conversion of these into a standard format would cost millions of dollars.

The Encoded Archival Description (EAD) SGML Document Type Definition (DTD) was originally developed at the University of California at Berkeley in 1993 as a way of encoding finding aids for searching and display on the Internet [1]. It has quickly evolved into a prominent standard within the archival community for making their finding aids available online in a much more effective way than had ever been possible before. The original intention of the DTD was not to proscribe unusual finding aids, but to accommodate whatever they might contain, in whatever order it was presented, within reason. This strategy made it possible for many institutions to try using EAD that had finding aids organized in fundamentally different ways, contributing to the rapid adoption of this new standard. However, the result of this flexibility was unpredictability: no two finding aids could be guaranteed to contain the same type of information in the same place within the document. At best, one could assume that if there was any descriptive frontmatter (such contextual information as provenance, biographical history or collection scope and content), it would come before the actual collection inventory, whatever that might consist of. Every finding has to have at least one item in its inventory, and some basic administrative information at the beginning, but that is about all that is required. It is also important to realize that the EAD DTD allows for the markup of both "structural" aspects of a finding aid (e.g., frontmatter and inventory, sections and paragraphs, etc.) and "semantic" aspects of the finding aid content (e.g., authors, titles, subjects, genres, dates, etc.), both of which are potentially useful but quite different ways of organizing search and display of these documents.

In addition to variation in finding aid content and arrangement, there is the matter of how the EAD is applied to the finding aids (the markup). Diversity in how EAD is used has at least two dimensions: the organization of the markup, and the depth of the markup. In the first case, placement of EAD elements may be done quite differently by two different encoders given the same text to encode. Guidelines for when to use the various EAD elements may eventually improve this situation, and the variation causes no particular harm; it just introduces unpredictability in the location of certain logical parts of the finding aid. In the second case, different encoders may, for either philosophical or pragmatic reasons, make very different decisions as to whether, where, and how often to use the more "semantic" elements of the EAD such as the name, subject and date elements. This will have little impact on display, but can greatly vary results in searching. But despite all this variation, the hope was, and is, that because of the power of SGML to encode documents in a way that makes them "self-describing", we would finally be able to search and view finding aids from collections at institutions all over the world, possibly even as a single virtual collection.

The "alpha" release of the EAD standard occurred in early 1996. Very soon after that, local online systems that supported the use of EAD for finding aids on the Web began to appear at institutions around the United States. By the official release of version 1.0 of EAD in mid-1998, these local systems had evolved to the point where interesting experimentation could begin on the further problems of searching and viewing finding aids beyond institutional boundaries. An immediate question in this regard was whether to tackle the problem with a "union catalog" of finding aids from all these institutions, or whether to pursue the Z39.50 model of distributed searching across many different local catalogs. The Archival Resources database developed by RLG had begun exploring the union catalog approach by combining summary collection descriptions and finding aids (with and without EAD markup) together with a single search and display interface that attempted to accommdate all the various practices. While the RLG system works very well, we felt the model had inherent limitations for accommodating local practice in a scalable way. In order to accommodate institutional variation, the system is forced to handle each new case for both indexing and display programs. Beyond a few dozen institutions, this could become quite difficult to accomplish and maintain over time.

The DFAS project had as its mission to explore how the alternative model would work: bringing together online finding aid catalogs developed at multiple institutions, and tailored to each institution's finding aid structure and markup practices, returning search and display results from the different systems using mutually agreed upon mappings between the institutions. We looked to the Z39.50 standard for the underlying protocol of the system: we were attempting to define what might become the Z39.50 attribute set for finding aids. The main objective of the project was to see if we could allow each institution to continue doing EAD markup optimized for its own needs, including local indexing decisions, and still produce a useful inter-institutional search and display system.

The DFAS System

The DFAS architecture required that each participating institution run the OpenText version 5© software as the underlying search engine, as well as middleware developed by the University of Michigan, and used the Web for delivering finding aids (in particular, the HTTP transport protocol). Each institution then customized its local system to create indexes of EAD elements it considered useful to its local research population. OpenText allows indexes to be defined as the content of specified SGML elements from the document, both anywhere they occur in the document and just in a particular context. For example, a "names" index can be defined which includes the content of all the various EAD "name" elements (<name>, <persname>, <corpname>, <famname>, <geogname>, etc.) as well as any other name-like elements (e.g., <origination>). Institutions took advantage of this ability and their local EAD markup decisions to define what elements would be included in what indexes. These indexes could also be called whatever the local institution liked in its local Web search interface.

Technically, distributing the search presented the challenge of handling varying network response times from each institution and of integrating results in a useful way in the event of network problems. Other projects have had serious problems with this architecture and have looked for alternatives to solve problems of network unreliability. Our solution for the DFAS project was to offer participants two options: search other sites remotely, or replicate remote catalogs locally so that functionally the behavior is the same, but whether a local system would actually go to a remote institution via the Internet was a local decision. One practical problem this raised was whether the replicated catalogs should use the local institution's indexing and display rules on the remote institution's finding aids, or use the owning institution's rules, forcing the replication of the remote institution's customized indexing and display routines as well (a considerable maintenance problem over time). Using the local institution's rules would be much simpler, but would violate the goal of allowing each institution to make the important decisions based on their own markup practices and finding aid structures. We concluded that true distribution of searching is preferable from a methodological point of view, but that from an architectural viewpoint, replication has many advantages and should be explored further.

There were six questions identified in the original proposal that the project hoped to answer:

  1. Is distributed searching a reasonable model for cross-collection searching of SGML-encoded finding aids;
  2. What indexes are minimally useful in a cross-collection search of encoded finding aids;
  3. What are useful approaches to handling indexing across diversely encoded finding aids to ensure reasonable search results;
  4. What are useful approaches to managing intermediate result sets from cross-collection searches using the distributed model;
  5. What are useful approaches to managing extremely large finding aids by presenting them as navigable structures, and the display screens needed to make this effective;
  6. How can displays be optimized across heterogeneously encoded finding aids to ensure consistent results?

The project succeeded in putting forward reasonable proposals to the second through last of these questions, and found that it is indeed possible to implement a distributed search system across diversely encoded finding aids to get meaningful results. The utility of this approach is the subject of the rest of this article, as well as the major issues that we discovered.

Indexing

Starting with the objective of identifying Z39.50-like attributes for searching finding aids, we began the DFAS project by attempting to identify "Common Access Points", or CAPs. These were to be presented as searchable indexes and given generic, or "synthetic," names which could then be mapped to locally chosen names in the web interface at the local institutions. We examined the existing finding aid catalogs of the participating institutions and immediately discovered that there were no indexes common across all five participants: some had chosen more structural indexes (common in SGML-based systems) such as "frontmatter" and "inventory", while others had chosen more traditional bibliographic indexes such as "author", "title", and "subject". As a compromise, we began with nine CAPs: Names, Dates, Titles, Places, Subjects, Repository, Contents, Summary, and Anywhere. The Anywhere, Contents and Summary CAPs represented the structural aspects of the finding aids, while the other CAPs were the more bibliographic access points. One of the questions we hoped to answer was how well researchers understood the different CAPs. The structural CAPs might not be useful if researchers don't understand the structure of finding aids ahead of time and if finding aids aren't constructed consistently (which they often aren't across institutions). But the semantic CAPs could also be confusing if encoders applied these more subjective tags inconsistently (which they often did) and if the underlying data varied in how much of this kind of information was included in the original text. Finding out how much lack of consistency in data, markup and, therefore, retrieval results affects users is a main objective of future research.

One project goal had been that each institution could define the EAD elements to be included under the various synthetic names (as is done in Z39.50). At the beginning of the project, the system prescribed which EAD elements would be included under each synthetic name, and this raised the interesting problem of differing local definitions for typical indexes. In Harvard's case, for example, it led to the existence of two access points in the local search interface labeled "Name (people and organizations)": one for the local catalog and another for the DFAS system, each containing a different set of EAD elements. In an effort to identify this difference for users, Harvard changed the name of the DFAS access point, but the confusion was difficult to clear up with simple interface changes. Another excellent example of this was the "Title" CAP. At Harvard, we have a Title index defined to include all the <title> elements in the finding aid, which are titles of monographs, poems, articles, etc. held in the collection. Elsewhere, Title indexes were often defined to include the <unittitle> elements of the finding aid: usually the collection title, and also the titles made up by the collection processor for the series, folders, and items in the collection. These are completely different things, but both can legitimately be considered "titles" and thus assigned to the Title CAP, producing wildly different search results. This clearly demonstrates how the model of allowing each institution to define which elements to include in each CAP without any prior agreement would almost certainly result in different institutions defining a given CAP quite differently. The conclusion we reached was that, while some flexibility in this mapping is still desirable to allow for local variation in practice, without some level of prior definition of which EAD elements are intended for each CAP (or Z39.50 attribute), this mapping would be too inconsistent to be useful for searching. Fortunately, within the context of the project, such agreement was possible. Another area for future research is whether the EAD community as a whole can come to similar agreements.

Display

There were three aspects of display which the project addressed: how to present initial search results from multiple, distributed institutions, how to present an intermediate "table of contents" (TOC) or "key words in context" (KWIC) display of retrieved finding aids, and how to present a single finding aid in its entirety. For the initial results sets, there was unanimous agreement to present the results ordered by institution so that if searching was being done remotely (as opposed to locally on a copy of the remote institution's finding aids), and if the network connection to the institution was slow or unavailable, it would not prevent the system from returning useful results from other institutions.

For intermediate finding aid displays, DFAS shows a "table of contents" view of the finding aid broken into different sections with hits in each section shown. This allows users to navigate a large finding aid in manageable pieces (and many finding aids are being created in excess of 2Mb, causing extreme distress in current generation web browsers). The system allows local institutions to select which divisions of the finding aids they would like to show users in this "table of contents" display, based on EAD elements which delineate different sections of the finding aid, since not all institutions share the same structural elements in their finding aids.

As for the display of entire finding aids, DFAS follows the common practice of converting SGML to HTML, and supports XSL stylesheets for institutions wishing to experiment with that. Since the project team assumed that XSL will eventually make the HTML conversion obsolete, we felt it would not be useful for us to dwell too long on this particular issue.

Summary

The DFAS project implemented an online distributed search system for diversely encoded finding aids and achieved useful results. The architecture allows for local replication of finding aids catalogs in cases where network problems prevent useful results using network distribution. The system allows for local decisions regarding mapping of EAD elements to indexes, and to display attributes. Our research has highlighted the problems caused by the lack of standardization in the application of EAD to finding aids, yet that lack of standardization is not easily overcome given the diversity of the underlying documents.

There are undeniable advantages to distributed search systems over union catalog models. They allow for locally designed system interfaces (which can look like other systems in use at the institution, use familiar index names, contain links to local help resources, provide local printing options, etc.). They also allow for reliable linking to local digital resources beyond the catalog. Moreover, indexes can be defined using knowledge of local practices, rather than generic rules that cannot accommodate variations across institutions.

However, allowing for diversity also introduces a need for greater understanding among participants about the implications of their decisions, and for much greater consensus among participants as to what is being done and how.

Some of the non-technical areas we have identified as requiring further research are:

  1. Community discussion about implications of EAD application for search, retrieval and display of finding aids, particularly in the areas of subject, name, and title searching. With greater consensus about how these types of elements are applied to finding aids, we could hope for better retrieval consistency in catalogs.
  2. Statistical studies on retrieval dependent on different Common Access Point mappings (that is, which EAD elements should be mapped to which access points) from a user's perspective.
  3. Discussion of methods and rules for date normalization; exploration of the possibility of using controlled access terms for dates. Searching on dates is central to archival research and needs to be better supported.

Notes and References

[CAPS] See "DFAS: DLPS White Paper on Common Access Points" at  <http://www.umdl.umich.edu/dlps/dfas/capwp.html >

[DFAS Project Final Report] "Supporting Access to Diverse and Distributed Finding Aids: A Final Report to the Digital Library Federation on the Distributed Finding Aid Server Project."  <http://www.umdl.umich.edu/dlps/dfas/dfas-final.html >

[Digital Library Federation]  <http://www.clir.org/diglib/dlfhomepage.htm >

[Encoded Archival Description] For a recent description of the EAD standard, see Pitti, Daniel Encoded Archival Description in the November 1999 issue of D-Lib Magazine <http://www.dlib.org/dlib/november99/11pitti.html>.

[RLG Archival Resources Database] See the Research Libraries Group (RLG) Archival Resources at  <http://www.rlg.org/arr/ >

Copyright 2000 MacKenzie Smith
<img src= Line
Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | Next Story
Home | E-mail the Editor
Spacer Line
Spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/january2000-smith