D-Lib Magazine
April 1999

Volume 5 Number 4

ISSN 1082-9873

Reality and Chimeras in the Preservation of Electronic Records

blue line

David Bearman
Archives & Museum Informatics
[email protected]

(This Opinion piece presents the opinions of the author. It does not necessarily reflect the views of D-Lib Magazine, its publisher, the Corporation for National Research Initiatives, or its sponsor.)

The preservation of electronic records is a well known and serious problem. A recent proposal for a "magic bullet solution" (a simple, universally applicable, one-time fix) is, therefore, troubling particularly because it's been disseminated through a reputable agency, fails to take account of significant bodies of prior literature, and is not accompanied by the caveats that ought to be associated with untested hypotheses in a delicate sphere of public policy.1 In this brief note, I will examine why a proposed "emulation" approach does not adequately address the problems of maintaining electronic records, won't work as a strategy, and may encourage potentially dangerous wishful thinking.

In January 1999, the Council on Library and Information Resources published a paper by Jeff Rothenberg as "the first in a series" of papers on preserving digital information. Because no plan of future papers was announced, and because pre-publication reports suggested that CLIR had commissioned a "proof of concept" for the strategy based on "emulation", some have taken this to mean that emulation is being proposed as the approach, that it has been tested, and that it will free them from the need to take other action about their electronic records. As a matter of public policy, it is important to state that emulation is not a viable approach to preservation at this time and to note that even Rothenberg does not suggest that it is. Electronic records that are not moved out of obsolete hardware and software environments are very likely to die with them.

I will not discuss whether emulating hardware and software systems actually works now or is likely to work in a general way in the future. Many computer scientists more knowledgeable about some of the technical issues than I believe emulation is very likely never to be viable. Nor do I want to belabor the specific examples of emulation that Rothenberg cites, though I think they are better proofs that emulation does not work than that it does. Instead, I would note that Rothenberg is fundamentally trying to preserve the wrong thing by preserving information systems functionality rather than records. As a consequence, the emulation solution would not preserve electronic records as evidence even if it could be made to work and is serious overkill for most electronic documents where preserving evidence is not a requirement.

Jeff Rothenberg quite properly calls attention to the serious need for "a long-term strategy to ensure that digital information will be readable in the future". Unfortunately being "readable" in the future is not the sole, or sufficient, functional requirement for electronic preservation.2 Failure to examine in detail what makes an electronic record evidence over time has led Rothenberg, and many others, to assume that they want to preserve system functionality.

Let me explain by example in this brief note, what has been documented in great detail elsewhere.3 Let us imagine a personnel information system operated by your employer. It stores data in a way that is proprietary to its software system, which runs on other software and hardware proprietary to others. The "records" of the system are the various transactions entered into the system (when you were hired, promoted, given a new retirement plan), as well as the various transactions which were reported from the system (your paychecks, your annual retirement benefits report, the monthly payroll report, etc.). The state of the database at any given moment is not a record. If we captured the state of the database at 12:15.32 today, and emulated the entire environment in which this system existed fifty years from now, we would not have any records of your employment. We might have the theoretical capability of making records that look like ones which could have been created about your employment in the original system. If we want to preserve electronic records, what we really want are records of the actual inputs and outputs from the system to be maintained as evidence over time. This does not require the information system to function as it once did. All (!) it requires is that we can capture all transactions entering and leaving the system when they are created, ensuring that the original context of their creation and content is documented, and that the requirements of evidence are preserved over time.

The hope that emulation could be a viable strategy for preserving records (except for those records which are themselves executable code), is not viable.4 Although, or in part because, the objective of emulation is mistaken, it is important to review the path which Rothenberg took to get there.

After arguing well and convincingly the importance of long-term preservation of digital objects, Rothenberg examines and dismisses three approaches which I agree won't work. He also dismisses a fourth, which I believe is the most promising strategy for preservation of electronic records and the only one that has worked to date. Rothenberg's strawmen are:

However, Rothenberg also dismisses the strategy of migrating electronic records systematically before they become inaccessible, so that they are always available to current generations of software and hardware. Here, I believe Rothenberg is wrong. Because Rothenberg imagines that preserving electronic records requires preserving the information systems by which they were generated, he asserts that "migration is labor-intensive, time-consuming, expensive, error-prone, and fraught with danger of losing or corrupting information" and that "automatic conversion is rarely possible". In support of this argument, Rothenberg cites only two pieces of secondary literature, both discussing aspects of systems migration which are not relevant to preservation of records but rather to migration of systems, and both of which were published prior to 1993. Though claiming the issues have to do with cost and difficulty of migrating "documents", he does not cite a single migration case study.

Rothenberg's abstract argument, that translation always involves loss of information, is plausible, but not, as he imagines, very relevant.8 If it was true, his own case for emulation, which depends on a much more complex translation than that envisioned by those who would move each generation of records forward incrementally, would be fatally flawed. But for the case to be validated, Rothenberg would need to specify just what characteristics of records are crucial to preserve (or as I would put it, to their properties as evidence) and how these are effected by "translation". Rothenberg's failure to address the nature of evidence and the functional requirements for its retention is a serious weakness. It ignores the results of the largest study of electronic records funded by the National Historical Records and Publications Commission, and the dominant theme of discussion among archivists outside the United States over the past five years.9 At the very least, Rothenberg and others who wish to argue for approaches that do not satisfy the functional requirements for recordkeeping need to demonstrate why they have not considered these basic requirements.

It is worth noting that systems migration literature suggests that a significant contributor to high cost and low reliability migrations is the absence of unambiguous specifications of the source and target environments. The problem, quite often is that either (or both) source and target are protected by proprietary interests of the software producers. If this is a hurdle in systems migration, it is an absolute barrier to the viability of the emulation which will try to replicate a system environment years, decades, or perhaps centuries, after it became obsolete. Rothenberg acknowledges that "saving proprietary software, hardware specifications and documentation, as required by this emulation strategy, raises potential intellectual property issues". He correctly identifies this as "required" for the solution to function, yet refers to its as an "ancillary issue" in the title he gives the two paragraphs (section 8.3) in which it is discussed. Nowhere does Rothenberg suggest the dimensions of the Herculean social task that would be involved in creating a trusted agency for acquiring such proprietary information and administering it over time.

This is not the place to advance the complete case for a migration based solution.10 However, if we could imagine some software or hardware emulation made possible by deposit of instructions in metadata encapsulated form, we still need an architecture for how the layers of dependency would be specified in metadata or a concrete definition of the kinds of metadata contents or methods envisaged. Rothenberg's proposal does not even try to define the elements of metadata specifications that would be required for the almost unimaginably complex task of emulating proprietary application software of another era, running on, and in conjunction with, application interface programs from numerous sources, on operating systems that are obsolete, and in hardware environments that are proprietary and obsolete. And at the same time, the alternative migration-based strategy which has advanced such metadata specifications and proposed possible architectures, has been overlooked.11

The good news is that Rothenberg's emulation strategy shares a common research agenda with proposed solutions based on continuing migration which I feel are much more practical. Since there is a convergence of opinion that research and demonstration must 'prove the metadata encapsulation' strategies for any of these electronic preservation models to succeed, it is critical that we move forward in specifying this metadata.

Minimally this will require that we articulate clear functional requirements for preserving records, define concrete metadata contents and value-schemes to represent the necessary attributes, and suggest ways to layer the resulting metadata to support its long-term management and efficient processing. But specifying metadata, even if we could agree on complete, necessary and sufficient sets of metadata, will not in itself solve the question of how to implement the capture, storage, retrieval, management, presentation and re-presentation of metadata encapsulation-based solutions.

Serious proposals for metadata encapsulation strategies need to address how the required metadata will be identified, created or captured at the time of the creation of the records; by what means it will be stored in inviolable conjunction with the record contents; how it will support the use of the record by authorized users over time; and by whom, where, and at what costs the infrastructure for recordkeeping will be constructed and maintained.

Architectural models, with concrete local and universal implementation plans, are crucial. Interestingly, the three proposals discussed here again converge in suggesting strategies in which records are captured, at the time of their creation, and encapsulated in a single logical file format structure with associated metadata. This added metadata shell surrounding this single, universal logical file format can then be read, either to migrate the records as necessary, to emulate systems that they lived in, or to open them in their native format at some distant future time. In the BAC proposal,3 I suggest how the metadata and the architecture together support the on-going management of the records -- their retention and disposition, their rights clearance and privacy protection, their access, their use and reuse.12 It would be valuable in the next stages of this collective endeavor if other functional requirements for these preserved containers could be identified so that architectures, implementations and metadata declarations could be developed to satisfy them.13

Copyright © 1999 David Bearman

Links to specific notes and references are located throughout the text above. The complete listing of notes and references is at < http://www.dlib.org/dlib/april99/bearman/bearman-notes.html >.

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Letters | Project Briefing
Home | E-mail the Editor

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/april99-bearman