Historical Collections for the National Digital Library

Lessons and Challenges at the Library of Congress

Caroline R. Arms
Information Technology Services & National Digital Library Program
Library of Congress
D-Lib Magazine, April 1996

ISSN 1082-9873

[This is the first of a two-part story. The second part appears in the May issue of this magazine. Editor, November 26, 1996.]

"For the general public, the Congress has endorsed the creation of a National Digital Library through a private-public partnership that will create high-quality content in electronic form and thereby provide remote access to the most interesting and educationally valuable core of the Library's Americana collections. Schools, libraries, and homes will have access to new and important material in their own localities along with the same freedom readers have always had within public reading rooms to interpret, rearrange, and use the material for their own individual needs."
James Billington, Librarian of Congress, Fall 1995

http://lcweb2.loc.gov/ammem/ammemhome.html

Contents

Introduction

Background and Current Progress

Lessons and Challenges -- part 1

Digitizing historical collections is different

Planning, project management, and quality assurance take time

Naming schemes should be established early

Uniformity of description is impractical

Where should descriptive information be stored?

To Be Continued data-cfemail="6a090b0b182a060509440d051c">[email protected]
Footnotes, References, and URLs (in a separate document)
An informal model for access to the NDL historical collections (diagram in a separate document)

Introduction

The National Digital Library Program at the Library of Congress is a five-year program to assemble an initial core of American historical and cultural primary source material in digital form, selected for conversion from the Library's vast holdings of print and non-print materials. Among the Library's treasures are the papers of 23 presidents. Rare books, pamphlets, and papers provide valuable material for the study of historical events, periods, and movements. Millions of photographs, prints, maps, musical scores, sound recordings, and moving images in various formats reflect trends and represent people and places. One aspect of the National Digital Library Program (NDLP) involves selecting and digitizing a heterogeneous array of resources, many of which are fragile -- no routine task in itself. The NDLP must also provide convenient and effective access to the digital archive it assembles, access for a broad constituency of users, from scholarly researchers and professionals to college students and schoolchildren. Although the Library of Congress is not an institution for research and development, the Library is faced, as it has been in the past, with the challenge of applying new technology as it performs its statutory roles of library service to Congress and the protection of intellectual property rights, and its constitutional mandate to "promote the Progress of Science and useful Arts."

In his Fall 1995 Mission and Strategic Priorities of the Library of Congress, James Billington, Librarian of Congress, summarizes the Library's mission as "to make its resources available and useful to the Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations." In the past, access to this universal collection has been through reference librarians acting as mediators; through reading rooms open to researchers; and through the painstaking cataloging and description of materials and the sharing of catalog resources with other libraries. The Library of Congress has been a leader in the use of automation to share catalog records among libraries and to make its catalog available to support the cataloging activities of other libraries. The Library led the effort starting in the 1960s to develop a standard format for Machine-Readable Cataloging (MARC), and in the 1980s to develop a standard for information search and retrieval over networks (Z39.50). These standards are basic building blocks in today's online library catalogs.

Expectations and the associated technological challenges have grown in recent years. Networks can link the Library's traditional clientele directly to its resources and allow access by new types of user. Content for the "universal collection" is increasingly published or potentially available in digital form.

The National Digital Library Program is one of several projects that address the challenges of providing direct access for users to the Library's resources and services. Another example is the THOMAS service, which supports public access to information on the activities of Congress, including all versions of legislation under consideration and the complete text of the Congressional Record. In the Copyright Office, a prototype system (CORDS) for accepting materials in digital form for copyright registration and deposit is under development. The first digital deposit, an unpublished computer science dissertation by Amy Moormann Zaremski of Carnegie Mellon University, was registered on February 27, 1996. Six other universities will participate in the initial phase of the CORDS program.

As far as possible, the Library is using the same tools and technological framework for these different projects, since activities that are currently treated as special projects must all be integrated into the Library's production systems. Through each of these digital library programs, lessons are learned, and challenges that require research and development are exposed. This article will summarize the current status of the National Digital Library Program, emphasizing issues that experience has shown to be important, whether they have been resolved or are presented to the research community as challenges that must be addressed as an overall architecture for digital libraries is developed.

Background and Current Progress

Announced in November 1994, the National Digital Library Program (NDLP) has identified two hundred of the Americana collections at the Library of Congress as an initial pool of candidates for conversion to digital form. [Libraries and archives traditionally handle many historical and other special materials as collections rather than as individual items, particularly when the items are personal papers or pictures rather than bound volumes or published recordings.] Factors that influence selection for conversion include uniqueness of the materials, synergy with other activities in custodial divisions (such as preservation), the availability of suitable digitizing technology, and the value of the materials for education. These collections will be made accessible to the general public and educational institutions over the Internet and through other means. The NDLP also plans to share the distributed model that it builds for storing and disseminating digital materials with other libraries. To achieve these goals, the Library of Congress is requesting $15 million in appropriated funds over five years and plans to raise $45 million from the private sector. As of April 1996, commitments for $20 million have been made by individuals, corporations, and foundations.

The program builds on the experience from two earlier pilot projects, the Optical Disk Pilot Project and the American Memory program. Between 1982 and 1987, the Optical Disk Pilot Project, captured text and images in several custodial divisions, using both analog and digital formats. For example, the Library's Prints & Photographs Division explored the capture of a variety of pictorial material onto videodisc, the description (cataloging) of these images, and the development of a single environment that allows users to search the descriptive records and immediately retrieve and display a selected image from one of the two videodiscs created [1]. Two generations of system using the videodiscs are in use today in the Prints & Photographs Reading Room. The second generation, a single station using six videodiscs, will soon be superseded by a system based on digital images. The Digital One-Box is being developed in close parallel with the National Digital Library Program, but also incorporates catalog records for collections that have not been digitized.

From 1989 to 1994, the American Memory pilot program reproduced selected collections for national dissemination in computerized form. Collections were selected for their value for the study of American history and culture and to explore the problems of working with materials of various types (such as prints, negatives, early motion pictures, recorded sound, and textual documents). American Memory used a combination of digitized representations on CD-ROM and analog forms on videodisc. Prototype presentation software was developed and distributed to test sites. A key discovery from the pilot was the enthusiasm of K-12 teachers for access to these primary source materials. In the final year of the pilot, advantage was taken of the World Wide Web. In June 1994, three American Memory collections, all of photographs, were made available over the Internet.

The National Digital Library Program is built directly on the American Memory experience. The Final Report of the American Memory User Evaluation and three white papers describe various aspects of the earlier project. The issues discussed and approaches presented in the white papers explain how many technical decisions relating to digital conversion were reached. Although some of the technical specifics have changed (for example, MPEG had not yet emerged as a widely adopted standard for video compression and delivery), the issues and general approaches laid out provide the basis for the continuing digital conversion program and the presentation of collections. Staff from the American Memory project form the core of the expanded NDLP staff; several former consultants or contractors have joined the Library. Their experience in planning and executing the earlier projects is framing the development of standard planning tools for managing larger ventures, such as the conversion of pre-1824 congressional journals held by the Law Library.

Today, in April 1996, a dozen historical collections are accessible on the American Memory web pages, including sound recordings of political speeches from around 1920 and early documentary movies as well as textual materials (books, pamphlets, documents, and manuscripts) and over 30,000 photographs. Several collections were adapted from the earlier pilot program. Each collection is an integral whole, but the basis for integrity may vary. Some are coherent archives, such as the papers or photograph collection of a particular individual or organization. Others are collections of items in a special original form, such as daguerreotypes or paper prints of early films. A third type of collection is a thematic "anthology" assembled by a team of scholars from materials in various forms and held in various divisions and collections across the Library. The first anthology collections should be released later this year, one focussing on the history of the conservation movement and another on the transition in the 1920s to a consumer economy. The size of individual collections varies enormously. The Detroit Publishing Company collection has 25,000 photographs; the collection of daguerreotypes has 600. 165 books and pamphlets from the National American Woman Suffrage Association collection comprise around 10,000 pages.

In a new direction for the Library of Congress, an Educational Services Team has been formed to focus on educational outreach. In March a new resource aimed primarily at K-12 teachers and students was added. The NDLP Learning Page offers help with finding materials relating to particular people, places, events, dates, and curricular topics.

The diversity of the archive is enthralling, and the potential offered by digital formats to reach new audiences and provide new function is exciting. Building a flexible technical infrastructure for the long term that allows a wide array of users to locate, retrieve, and use these materials will be an ongoing challenge. The Library is working closely with agencies and organizations involved in research and development projects that can be applied directly to the real problems of storing, managing, indexing, and presenting these collections. Meanwhile, the National Digital Library Program is developing its technical framework step by step, dealing with challenges as they present themselves and taking advantage of new tools, standards, and national infrastructure improvements as they emerge.

Lessons learned and challenges posed:

Digitizing historical collections is different.

Archives of historical artifacts pose problems for digitization that are not typical of the emerging industry for scanning text and images for business and government. Production scanners are designed for single sheets of paper or 35mm slides. Most contractors are not experienced at generating good digital images from large-format negatives or from "preservation microfilm," commonly used in archival collections to eliminate handling of fragile or deteriorating originals. Only recently have scanners been developed that allow book pages to be scanned without damaging the binding; compared to production scanners they are expensive and slow. The Library has contracted out most of its digitization and expects that pattern to continue, because of the array of different materials to be handled and the varying production load [2]. Contracting facilitates the use of specialist firms with the most appropriate equipment.

Usually, capture from original materials is performed on site under curatorial supervision. For example, most prints and photographs have been captured on 35mm film for subsequent digitization or, in earlier projects, for transfer to videodisc [3]. Documents and books have been scanned on site and the resulting bit-mapped images used as the basis for generating searchable versions marked up in SGML (using a document type definition based on that developed by the Text Encoding Initiative) . To date, all contractors have chosen to meet the 99.95% character-based accuracy specified for searchable texts by re-keying and not by optical character recognition.

One issue that can not be adequately addressed here is an ongoing topic of discussion at the Library: the potential for digital versions to serve as preservation copies. Traditionally, preservation of content has focussed on creating a facsimile, as faithful a copy of the original as feasible, on a long-lasting medium. The most widely accepted method for preserving the information in textual materials is microfilming and for pictorial materials is photographic reproduction. One aspect of the discussion relates to the question of when it is appropriate to generate a digital version that attempts to be a faithful copy of an item and when to take advantage of the potential for enhancing access to the content. Should the legibility of a manuscript page be improved by adjusting the contrast? For a photographic print, the faithful copy may be most appropriate. However, if the Library owns the negative but no print, is it appropriate to make digital adjustments to sharpen the image for presentation? Can general principles be developed to guide such decisions?

Planning, project management, and quality assurance take time

Scanning is only one part of a digitization project. Significant staff time and technical expertise are required to develop workflow plans and contract specifications, prepare materials for scanning, monitor progress, and perform quality review. Contractors must be made aware that the Library may expect a different balance between efficiency, quality of digital reproduction, and protection of the original artifact than their other customers. Preparing detailed instructions for contractors and testing the instructions on small batches is vital to prevent unpleasant surprises. Physical preparation of the materials (sorting them into batches and identifying problems before they hold up the scanning process) and quality review are also time-consuming and less predictable than the scanning operation.

A current plan to scan 60,000 pages from early congressional journals in bound volumes calls for three people to prepare materials to keep five scanners (with two people per scanner) busy for twelve weeks. Another three full-time people are expected to take twenty weeks to review scanned page-images and derived text versions marked up with SGML after delivery by the contractor. In some cases, preparation and quality review are performed by members of the NDLP Digital Conversion Team. In others, the NDLP has supported the hiring of staff to be based within the divisions responsible for different types of material (such as Music, Prints & Photographs, or Geography & Maps). Although the NDLP team is developing standardized approaches, the conversion of each collection presents special circumstances, sometimes because of the physical condition of the originals or because scanning must be coordinated with curatorial activities for preservation or description of the collection.

Naming schemes should be established early

It is widely accepted that consistent, organized access to resources on the Internet is inhibited by the lack of a robust, universal naming scheme for networked resources. Unique, persistent identifiers are needed to link catalog records and other descriptive materials to the stored digital resources they describe. American Memory experience demonstrates that it is valuable to establish a naming scheme as part of the initial production plan for digitizing a collection. By establishing names for items early, descriptive materials can be prepared independently of the digitization process. The naming scheme also provides a structure for project control and for defining batches. Names can relate digital representations to the corresponding original items during the production process. The naming scheme can be the basis for monitoring progress, logging problems during the scanning process, and performing quality assurance checks. The details of a naming scheme are often based on the organization of the original material or a previous reproduction. Several of the photographic collections have been named using frame numbers from videodiscs prepared during the Optical Disk Pilot Project. Manuscript collections are often organized into a logical sequence and placed into numbered folders within numbered containers; the numbers can be the basis for the naming scheme. In the longer term, the name will have no semantic "meaning" when used to retrieve the corresponding digital item. However, during production, it helps to use names that bear a clear relationship to the materials being handled.

The Library has developed a general approach to naming based on its experience to date. Each item has a two-part "logical" name that consists of a collection identifier and an identifier for the item (unique within the collection) [4]. Currently, the digital archive is a collection of files in a hierarchical Unix file structure. The logical name can be parsed and used to derive full "physical" path- and filenames for an item following simple rules expressed in a table. If the file hierarchy must be reorganized or a collection moved to another machine, only the locator table must be altered.

In the future, the logical name will be used as a Uniform Resource Name (URN), a globally unique, persistent, location-independent identifier. During 1996, the Library is working with the Corporation for National Research Initiatives (CNRI) on a prototype repository for storing and managing digital items and collections. Items stored in the repository will be given handles (constructed from the existing logical names) using the CNRI scheme for URNs. The CNRI Handle Server will provide resolution for URNs, mapping them into specific, location-dependent identifiers, such as URLs (Uniform Resource Locators, the addresses used today for resources accessible directly through the World Wide Web).

Uniformity of description is impractical

Digital reproductions need descriptions that support searching or browsing so that users can identify resources of interest. The process of organizing and describing unpublished and historical materials is time-consuming. The level of description possible is often severely limited because information is not available without significant research, for example, to identify subjects in portrait photographs. Most archival collections, at LC and elsewhere, are not fully cataloged at the level of each individual item.

Many collections are described in a "finding aid," a document (often unpublished) that describes the scope, contents, and provenance of the collection in general terms and presents its contents as a structured list that corresponds to the physical organization of the collection, which might be chronological, geographical, or thematic. A published finding aid often includes supplementary materials, such as essays, biographies, or chronologies that provide an intellectual context for the collection. Recently, as a result of work started by Daniel Pitti in the Berkeley Finding Aid Project, a draft standard has been developed for an SGML Document Type Definition appropriate for finding aids. The standard will also support direct links to digitized items. Known as the Encoded Archival Description (EAD), the draft standard is under initial test at several institutions, including the Library of Congress, which will act as maintenance agency for the standard after the initial test phase. The NDLP plans to incorporate finding aids marked up according to the EAD standard for some future collections.

Another alternative to item-level cataloging that will be explored is group-level cataloging. Collections can be described by a set of catalog records for logical groupings of items. An example of a grouping is views of the Shenandoah Valley taken by a particular photographer on a particular assignment. There may be no way to distinguish between the images without viewing them; individual catalog records would be identical apart from an identifier. The user should be equally well served by a single catalog record linked to several images.

To prepare individual full catalog records for all the items in the collections that the Library plans to digitize would not only be futile in many cases (as in the example above). It would be totally infeasible. At UC Berkeley, it was estimated that it would take the entire cataloging staff from all Berkeley libraries 400 years to catalog the collection of 3.5 million images [5]. Use must be made of descriptions that already exist. The practices of archival communities that are less expensive than item-level cataloging must be integrated into automated systems. However, archival practices are not uniform, often with good reason. Maps are organized and described differently from photographs. The Digital Image Access Project (sponsored by the Research Libraries Group) demonstrated that approaches to description are not consistent across institutions dealing with the same type of materials [5]. The level of description for a collection depends on institutional priorities and resources available at a point in time. One conclusion is clear. Digital archives must be built with a recognition that the level and structure of description (the sources of metadata) will be very variable.

Where should descriptive information be stored?

The Library of Congress model for networked information retrieval separates the access tools (indexes, catalogs, finding aids) from the digital archive or repository that contains the resources themselves (see diagram). This allows resources to be pointed to directly from comprehensive catalogs, specialized indexes, finding aids, or scholarly works published online. The model also recognizes that the digital version is simply one option for access to the historical materials. Many of the materials being digitized have already been copied (onto media such as microfilm, videodisc, and photographic negatives) to allow access to the content of fragile or valuable items without physical handling. Catalog records will describe the intellectual work and refer to several versions.

As work begins on the design of a repository that will provide better management and access control for the digital resources, there is a need to categorize the various types of descriptive information associated with a digital item. Which information supports intellectual access? Which information is needed to support the retrieval of a selected item in an appropriate digital format? And which is primarily for control and management of the digital archive and the historical collections? Determining which descriptive "fields" should be stored where is a current challenge. Options for storing such metadata include:

in the items themselves, for those file formats that support descriptive headers;
in linked items, for example by scanning for each digitized book or document a "target" page on which details are recorded about the digitization operation, such as the item's logical name, the equipment used, and special instructions for conversion;
in external catalogs or finding aids, to support the identification of relevant items by searching or browsing;
or integrated with the digital object in the repository structure, to support retrieval of an identified item.

Another challenge for repository design is the grouping of digital objects. A repository should clearly accommodate some types of grouping, such as the representation of the same original item in several formats, and collections of related files that must be combined to display a single item. The repository must also have enough information so that a user who "finds" a particular item can be pointed easily to its intellectual context, perhaps by a pointer to the collection (or collections) to which the item belongs.

Other issues related to metadata are not so clearcut. Should the grouping of views of the Shenandoah Valley (mentioned above) be represented in the repository structure or only in an external finding aid which presents the entire collection of photographs in a comprehensive hierarchy by photographer, date, and location? Should a finding aid be used to generate a full set of skeletal item-level bibliographic records, recognizing that many will be essentially identical? Might it be possible to derive a composite "virtual catalog record" or a set of indexing terms for an item from information stored in various places: the item header, with the digital item in the repository, and external descriptions from collection-level or group-level catalog records and headings in finding aids? Should such a record be compiled once and stored permanently or regenerated as part of any indexing or re-indexing procedure?

To Be Continued

The first part of this article has concentrated on the digital conversion process. The second part will address challenges in storing and managing the digital archive, and in providing access and navigation tools to help users select and use items in the historical collections. In the meanwhile, take some time to explore the Historical Collections from the National Digital Library Program.

hdl://cnri.dlib/april96-c.arms