In October 2006, the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities and the Library of Congress (LC), released a beta of Chronicling America to a limited audience. (Release to the general public occurred in March 2007.) The three goals for the program are to support the digitization of historically significant newspapers, facilitate public access via a Web site, and provide for the long-term preservation of these materials by constructing a digital repository. Chronicling America supports the latter two goals. It provides access to more than 226,000 digitized pages of public domain newspapers from California, Florida, Kentucky, New York, Utah, Virginia and the District of Columbia published between 1900 and 1910. While the beta release contained 226,000 pages of newspapers digitized by six awardee1 institutions and LC, the project plan calls for tens of millions of pages to eventually be available from Chronicling America over the 20-year length of the project. The Web site also provides bibliographic records for the 140,000 newspapers published in the United States between 1690 and the present.
In addition to the Web site, Chronicling America also has a digital repository component that houses the digitized newspapers, supporting access and facilitating long-term preservation. Taking on access and preservation in a single system was both a deliberate decision and a deviation from past practices at LC.
The culmination of roughly two years of technical development by LC, the release of the beta is an obvious juncture in the project for the development team to reflect upon the work to date. The purpose of this article is to articulate some "lessons learned" about the long-term preservation of digital content, since during the development phase a number of preservation threats, such as those described in Rosenthal et al.,2 have been actualized.3 Enumerating these threats is of value to the NDNP development team, so that plans can be made to mitigate them. It is hoped that identifying the actualized threats will be of value to other teams that are developing digital repositories, so that they may learn some practical lessons from the preservation challenges encountered by the NDNP development team. Already, the preservation threat lessons learned from NDNP are influencing other development efforts at the Library of Congress.
Before proceeding, the context for this article must be set properly. First, a team that had not previously worked together developed NDNP, using Agile-influenced software development methodologies4 that had never before been practiced at LC. Second, NDNP was hosted by the Repository Development Center (RDC), a new development lab at LC that was being deployed in the midst of the development of NDNP. Thus, the environment in which NDNP was being developed was undergoing constant change and experimentation. Third, in order to provide public access as rapidly as possible, ingest of digitized newspapers into the repository began while the repository was still under development. This is probably fairly common with digital repository development efforts; few probably have the luxury of completing development before beginning ingest, especially when huge amounts of content are involved and/or the ingest is very time consuming. Thus, the NDNP development team was working in a development environment that lacked some of the safeguards of a production environment and producing code that was rapidly changing and in various states of testing.
Early in the project, the NDNP development team made the explicit decision not to "trust" the repository until some later point. Thus, prior to ingest of digitized newspapers, the original files delivered by awardees were stored and backed up in a completely separate environment. This allowed us to encounter the sort of problems described below without having to be concerned that the preservation of the content was actually put at risk. At some point in the future, after the NDNP development team has gained confidence that the appropriate preservation risks have been mitigated, the repository will be "trusted" for the storage of the digitized newspapers.
Also, before proceeding with this article, a brief description of repository architecture may be useful. For the beta release, the back end of the repository was an instance of Fedora, that stores datastreams for all of the content files and the various descriptive, structural, and preservation metadata. The content files included TIFF, JPEG2000, and PDF representations of the images of a newspaper page and ALTO representations of the text of a newspaper page; the metadata was encoded as METS, MODS, MARCXML, PREMIS, and MIX. The team used Apache Cocoon to provide behaviors for the digital objects, including actions such as image manipulation, "stitching back together" METS records that had been disassembled into parts for storage, and extracting data for indexing. The repository was fronted by a facade, also implemented using Cocoon, which exposed web services to support activities such as searching, access, and ingest. Searching was provided by Lucene, which was exposed through the facade with an SRW interface. An access application also implemented using Cocoon provided an interface for the end-user, taking advantage of the services provided by the facade.
Preservation threats encountered by the NDNP team
The preservation threats that were actualized in the course of the project described in this article fall into four preservation threat categories listed in of Rosenthal et al.2: media failure, hardware failure, software failures, and operator errors. Note that the emphasis here will be on threats to preservation, as opposed to problems that caused downtime of the preservation system. Unlike, say, in the case of real-time medical systems, in the NDNP preservation context some amount of downtime is generally tolerable. So, for example, on the one-year anniversary of the RDC, digital certificates began expiring, taking down key parts of the RDC infrastructure including the LDAP server used for authentication. This is an example of a "Failure of Network Services" threat. However, new digital certificates were issued and the RDC was brought back up. While the software developers were inconvenienced, there was no risk of any data loss.
As is to be expected, various media failures were encountered during the development of the beta release of Chronicling America. In particular, problems were encountered with portable hard drives used to transfer files between awardees and LC. However, fixity checks performed on the files as part of the transfer process readily caught those problems. Procedures followed by awardees required them to maintain local copies of the files until their verification and backup by LC was complete.
The RDC storage system also encountered a number of hard drive failures. However, the RDC storage system is composed of multiple hard drive arrays, each configured as RAID 5 with a hot spare. This helps to prevent both data loss and downtime in the case of the failure of a single hard drive in a hard drive array. As a practical matter, the loss of a hard drive often resulted in some sort of minimal service disruption. In addition, in one case a second problem occurred within an array while the storage system was rebuilding the hot swap to replace a failed hard drive. This resulted in the loss of a small amount of data from the storage system. Fortunately, the file system diagnostics were able to identify the corrupted files, and they could be restored, since Fedora does not support auditing data under its management for bit-level errors.
Three software failures relevant to preservation occurred during project development. The first two software failures involved repository code written by the NDNP team; the third failure was with file system software.
The first software failure was the failure to successfully validate digital objects created by awardees. As described in Littman,5 the NDNP team has specified a submission information package (SIP) that involves a METS record and various other data files. To guarantee the quality of the digitized newspapers for both access and long-term preservation, the NDNP team has placed heavy emphasis on the validation of the contents of the SIP. This motivated the creation of the NDNP Validation Library, a software application that performs a careful inspection of the METS record and other data files. However, the validation of the METS record has proved to be difficult, as it is necessary not only to validate that the record is a valid METS record, i.e., according to the METS XML schema, but also that it conforms to the appropriate NDNP profile for the METS record. (There are different profiles depending on the type of digital object the METS record represents.) A combination of XML schema validation and Schematron schema validation is used to validate the METS record. Despite these efforts, gaps remained in validation that allowed awardees to submit METS records that passed validation and were ingested into the repository, but that did do not conform to the appropriate NDNP profile. For example, see below an issue METS record that an awardee produced:
<mods:part> <mods:extent unit="pages"> <mods:start>4</mods:start> </mods:extent> <mods:extent unit="page number"> <mods:start>4</mods:start> </mods:extent> </mods:part>
The correct MODS would be:
<mods:part> <mods:extent unit="pages"> <mods:start>4</mods:start> </mods:extent> <mods:detail unit="page number"> <mods:number>4</mods:number> </mods:detail> </mods:part>
Note that second <extent> should be a <detail>. Of course, once these gaps in validation have been detected, new validation rules can be implemented that prevent them in the future.
The second, and more problematic, software failure was a transformation failure. As part of the ingest process, a SIP to archival information package (AIP) transformation is performed. This involves the transformation of the METS record. Transformation of the METS record has proven to be complex and error prone. In the short-term, transformation errors "break" other parts of the application, and in the long-term they negatively impact the understandability of the records. In particular, two errors in transformation have been discovered. First, (ironically for the sake of preservation), the original SIP METS record is stored inline as part of the transformed AIP record. For newspaper title records, it was discovered that the transformation that put the original METS record inline was stripping the XML markup. Thus, this is what was found in the transformed record:
<mets:dmdSec ADMID="premismarcXmlBib" ID="marcXmlBib"> <mets:mdWrap LABEL="MarcXml bibliographic record" MDTYPE="MARC"> <mets:xmlData> 01059cas 2200301 a 4500 ocm09948037 OCoLC 20041011133654.0 830926d189319uudcuwr ne 0 a0eng d .... Vol. 11, no. 23 (Nov. 12, 1904) LIC </mets:xmlData> </mets:mdWrap> </mets:dmdSec>
Second, it was discovered that the SIP-to-AIP transformation was producing invalid METS records for newspaper issues. This is an outline of what was found in one transformed record:
<mets> <metsHdr>....</metsHdr> <dmdSec>....</dmdSec> <amdSec>....</amdSec> <dmdSec>....</dmdSec> </mets>
Notice the second <dmdSec> occurs after an <amdSec>, while MET's XML schema requires that all instances of a <dmdSec> occur before an <amdSec>.
The third software failure occurred when the XFS file system was corrupted, resulting in the loss of some data. This failure reinforced the difficulties of working with file systems that are huge and storage systems that are vastly complex. In the RDC's storage system, each file system is 8 terabytes, at the limits of what the file system is intended to support. At this scale, the file system itself and the tools that support it are not heavily tested. Furthermore, in a storage system as complex as the one used in the RDC, there are many layers. In particular, the actual file system is layered on top of a logical volume, which is provided by a volume manager; the logical volume is layered on top of physical volumes, provided by RAID; and the physical volumes are layered on top of the actual physical devices. The result of both the file system size and storage system complexity is that it is more difficult to ensure that nothing goes wrong and to diagnose problems when they do occur. In the case of our third software failure, the problem was initially reported as a hardware failure, though subsequent investigation led to the conclusion that the XFS file system was corrupted.
The most significant threats to preservation during Chronicling America development occurred as a result of operator errors. The first operator error involved the deletion of a large number of files from a section of a file system managed by Fedora. (The author humbly takes responsibility for this error.) The reason that the operator was manually deleting files from the Fedora managed file system was to compensate for the difficulty in performing cleanup of Fedora when an error occurs in the midst of a lengthy ingest. At that time, standard procedure was to rollback an ingest to delete the new files from the file system and replace the Fedora database with an earlier, check-pointed version. The deletion was completed despite file system level restrictions on the files, because the operator had used "super user" privileges to delete the files. The deletion of files was not detected until the application began reporting errors when trying to find files it expected to exist. Fedora's lack of auditing capabilities contributed to this problem.
The second operator error that occurred during development involved mistakes performed during ingest. For NDNP, ingests occur in "batches", which are groups of digitized newspapers delivered by awardees. In practice, the size of a batch is roughly 300 GB the amount that can fit on an external hard drive for delivery to LC. For the beta release, 32 batches were ingested, a small number in the scope of the overall project but large enough to create a management headache. At some point, the operator either ingested the same batch multiple times or perhaps failed to successfully purge a partially ingested batch before re-ingesting it. Unfortunately, the error was not discovered until much later, making the cleanup of the duplicates significantly more difficult.
Though strategies for mitigating these risks are not the focus of this article, it is worth noting that the NDNP development team has already implemented some significant architectural changes to address them. In particular, the system has been modified to perform what is referred to as "ingest-in-place," whereby the SIP is stored in its original form. Fedora no longer ingests and stores the datastreams for the content files; rather it references the datastreams in their location in the SIP. This guarantees that any errors in the SIP-to-AIP transformation can be corrected and allows the use of the existing tools for verifying the bit-level integrity of the files. It also offers the additional advantage of significantly improving the speed of ingest (from 12.8 seconds per newspaper page to .7 seconds per newspaper page).
Relative to the long-term obligation to preserve the digitized content, the beta development phase took place over an extremely short period of time. Nonetheless, within that period a broad range of preservation threats were experienced. Admitting to errors is a humbling experience, especially when the most significant threats to digital preservation have been caused by one's own actions. However, it further proves that developing digital repository systems is an extremely difficult and complex process. It is hoped that by reflecting upon and sharing its practical experiences, the NDNP development team will validate the approach advocated by Rosenthal et al. of focusing on threats to digital preservation, and that doing so will enable the team and others to create more robust digital repositories.
Notes and References
1. An awardee has received a grant from NEH to select and digitize newspapers to NDNP specifications.
2. Rosenthal, David S. H., Thomas Robertson, Tom Lipkis, Vicky Reich, Seth Morabito. Requirements for Digital Preservation Systems: A Bottom-Up Approach. D-Lib Magazine. <doi:10.1045/november2005-rosenthal>.
3. To provide continuity with the valuable framework provided by Rosenthal et al, the same terminology and threat categorization will be used in this paper.