Government Information in Legacy Formats: Scaling a Pilot Project to Enable Long-Term Access

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
July/August 2007

Volume 13 Number 7/8

ISSN 1082-9873

Government Information in Legacy Formats

Scaling a Pilot Project to Enable Long-Term Access

Gretchen Gano
New York University
<gretchen.gano@nyu.edu>

Julie Linden
Yale University
<julie.linden@yale.edu>

Introduction: the CD-ROM "problem"

For nearly twenty years, CD-ROMs and (more recently) DVD-ROMs have been popular delivery media for digital U.S. federal government information, with more than 5,400 distributed between 1987 and today. [1] While the number of CD-ROMs [2] distributed through the Federal Depository Library Program (FDLP) has decreased in recent years as more government information and data have become available on the web [1], depository libraries that selected CD-ROMs in that medium's heyday now have large legacy CD-ROM collections to manage. Data and information stored on those CD-ROMs have largely not been transferred to agency web sites or servers hosted by depository institutions. Depository libraries attempting to fulfill their mission of providing public access to these materials [3] face a range of challenges. Government CD-ROMs contain files in a wide range of formats, proprietary and non-proprietary, some still very much in use, some obsolete or nearly so. Certain software and file formats are simply incompatible with newer hardware and operating systems, leading some libraries to support one or more older computers in order to provide access to the CD-ROMs. [4]

Government agencies that publish CD-ROMs are not themselves required to have an archival plan for those materials. While the National Archives and Records Administration may accession and archive data collections that overlap with those issued on depository CD-ROMs, discovering, accessing, and obtaining the data in usable formats with appropriate documentation may prove time-consuming and difficult. Further complicating the matter, some government CD-ROMs package public domain data or information with commercial software, thus raising intellectual property issues for the library that wishes to migrate or otherwise preserve these CD-ROMs' contents.

The CD-ROM "problem" is recognized within the depository library community and by the U.S. Government Printing Office (GPO), which manages the depository library program. In 2004, librarian Tim Byrnes conducted a thorough analysis of approximately 5,200 CD-ROMs held at the University of Kentucky library. Byrnes shared his analysis with the Government Information Technology Committee of the American Library Association's Government Documents Round Table, which concluded that "meaningful analysis and correction of [the] CD-ROM legacy problem are extremely labor intensive" and that an "optimal solution" to the long-term preservation problem "must come from GPO & agencies with depository involvement."[1]

GPO has taken a step toward addressing the problem by launching a pilot project to test migration processes on a sample of federal agency CD-ROMs. However, GPO has not set a timeline either for completing the test or for the longer-term project of migrating content for all depository CD-ROMs, and the pilot thus far does not involve participation by depository libraries. [5] In addition to GPO's work, other efforts to address collections in legacy formats include the University of California San Diego "GPO Data Migration Project," which addressed 5 1/4" Microsoft DOS floppy disks distributed by GPO, and the "CIC Floppy Disk Project," a GPO-Indiana University partnership in which files from 3 1/2" floppy disks were transferred to a server, but the data on those disks were not migrated to archival formats. [6] The CD-ROM collection distributed through the FDLP is much larger than either of these two floppy disk collections and its preservation challenges have not yet been tackled on a large scale.

Despite the software and hardware problems that these CD-ROMs pose, the main challenges of a large-scale CD-ROM "rescue" project are not primarily technological. Files from CD-ROMs can be systematically copied to redundant, stable server environments. Obsolete file formats can be migrated to non-proprietary formats for continued use of the data; unusual or obsolete software programs can be made available through web-based virtualization. Rather, the main challenges are to organize and fund a collaborative rescue project so that institutions can contribute to different tasks as they are able and willing; to establish a decision-making framework so that portions of the collection that are at highest risk can be addressed first, according to agreed-upon standards; and to ensure quality control of both "rescued" CD-ROM files and associated metadata. The Yale Library pilot project described here has served not only as a means for analyzing and documenting aspects of a CD-ROM migration approach, but also as a launching pad for a community-wide consideration of a large-scale, distributed project to migrate this legacy collection and ensure permanent public access to government information distributed on CD-ROMs.

Yale Library CD-ROM migration pilot

Yale Library's Government Documents & Information Center (GDIC) holds more than 3,000 CD-ROMs in its U.S. federal depository collection. GDIC also holds approximately 360 CD-ROMs in its United Nations, European Union, Canadian, and Food and Agriculture Organization depository collections. While these CD-ROM collections are much smaller than the U.S. CD-ROM collection, they are similar to the U.S. collection in that the CD-ROMs may contain information that is not duplicated in other media, is contained in obsolete or proprietary file formats, or that has not been preserved in a more stable, long-term format than CD-ROMs.

In early 2006, under an internal library grant, GDIC librarians undertook a pilot project to address and document the major challenges to long-term access to government information on CD-ROMs by migrating a sample set of government and inter-governmental organization CD-ROMs to a more stable server environment. [7] Expected outcomes of the pilot included assessment and migration of select data from the at-risk collection with appropriate metadata created. The documentation for the project includes a draft workflow for migrating government information from CD-ROMs, initial preservation metadata requirements for these materials, and an analysis of the project's scalability and potential for collaboration with other institutions.

While librarians analyzed and migrated CD-ROMs from all the GDIC collections, this discussion focuses on considerations for a large-scale CD-ROM migration project of the U.S. federal depository collection, both because of size of the problem and the existence of a large, concerned community – depository libraries – that could be engaged in a collaborative solution. We believe that similar projects could be undertaken to migrate CD-ROMs from other governments and from intergovernmental organizations.

Methodology

Selection

Librarians determined selection criteria for the CD-ROMs to be migrated for this pilot project:

Criterion	Rationale
CD-ROMs have circulated	Indicates that the information has been used by library patrons
Information not available elsewhere in digital format	Information is therefore more at risk; migrating preserves it and makes it potentially more accessible
Information accessed with software that is proprietary, uncommon, or obsolete	Migration frees the data from software dependency and increases its accessibility
Information is free of copyright restrictions	Copyright restrictions may prohibit migration of data or providing networked access to data
CD-ROMs come from all GDIC collections and a variety of agencies	Provides a broad, diverse sample, potentially exposing a range of migration issues
Information is in a variety of file formats	Provides a broad, diverse sample, potentially exposing a range of migration issues

As a first cut, librarians analyzed circulation data for CD-ROMs in the GDIC collections and eliminated from consideration CD-ROMs that have not circulated and CD-ROMs that were part of GPO's planned CD-ROM migration pilot project (i.e., CD-ROMs from the Department of Justice, Department of Education, and U.S. Geological Survey).

Analyzing every remaining title on the list against each selection criterion was not practical, given time constraints. We thus focused on titles that had circulated more frequently and that we knew from experience might meet several of the selection criteria. Many titles we examined did not meet all of the criteria and were not selected as candidates for migration. For example, some CD-ROMs contain copyrighted material (the Foreign Broadcast Information Service publications); some file formats were too difficult to deal with in this pilot (the Environmental Protection Agency's Site Characterization Library, which contains agency-specific software); some information was already available elsewhere (often at the Inter-university Consortium for Political and Social Research, and thus available to Yale community; often available from web sites of agencies or other academic institutions). [8] Sixteen titles (representing 25 individual CD-ROMs) were ultimately chosen for further analysis and attempted migration.

Analysis and migration

In general, the workflow for evaluating a given CD-ROM title and migrating its contents consisted of transferring the files to the server and then analyzing the contents of each CD-ROM – its file storage hierarchy, file formats, and accompanying software and documentation. The graduate student worker conducting the sample migration documented her evaluation process for each title, including screen shots where appropriate. Next, where possible, data files in other formats were normalized to ASCII text files: common transformations included going from Excel, SETS [9], or Microsoft Word to ASCII or plain text. The student recorded errors or problems encountered during the migration task and worked with a Yale Social Science Statistical Laboratory consultant to troubleshoot those issues.

Metadata creation

Project organizers chose to generate MARC21XML, MODS, DDI, Dublin Core, and PREMIS metadata to describe each CD-ROM title and its data collection. These metadata make up what the Open Archival Information System Reference Model represents as the submission information package (SIP). [10] These metadata encompass bibliographic description suitable for library catalog (MARC21), cross-collection (MODS and DC), and domain-specific searching (DDI). The most detailed of these metadata types is DDI, the Data Documentation Initiative metadata specification, which was created by the social science data community to provide detailed markup of social science datasets. [11] DDI was appropriate for this project because the majority of CD-ROMs selected contained numeric datasets. DDI describes the "study" (that is, the intellectual content of the dataset), the individual files that comprise the dataset, and even the individual variables within each file. DDI can also describe external materials related to the dataset. PREMIS holds the technical and administrative information necessary to ensure data integrity and long-term preservation of the migrated files.

We decided to preserve both the original files copied from the CD-ROMs onto the server as well as the normalized (migrated) files, and therefore created separate SIPs for each instance (one for the original-format dataset, one for the normalized dataset.) The DDI records for original and normalized formats differ primarily at the file level, where information about individual file formats and migration activities will be recorded. The migrated files will become part of Yale's Social Science Data Archive. The DDI records will be loaded and searchable through StatCat, Yale's data catalog.

To begin the metadata generation process, the student used the free software MarcEdit [12] to generate MARC21XML and MODS from the records present for each title in the Yale Library catalog. To create the domain-specific metadata, librarians populated the DDI information by using an XSL stylesheet to transform the identified MARC fields to the study section of the DDI. [13] The stylesheet populated certain DDI elements with boilerplate text that will be common to all records, e.g., a generic statement about the migrated dataset ("Yale University Social Science Libraries and Information Services migrated the files in this dataset from their original formats.").

The additional metadata formats, Dublin Core and PREMIS, were generated using scripts associated with the ingest of the collection into a pilot Fedora repository [14].

Cost and time analysis

The majority of the student worker's time was spent analyzing the CD-ROMs and converting the data when necessary (some data were already in non-proprietary formats and simply needed to be copied to the server). She analyzed 25 separate CD-ROMs (some of which were part of a series, and thus contained similar file structures and formats), for an average of 1.75 hours per CD-ROM to analyze and migrate. At a student worker rate of $11 an hour, that is $19.25 per CD-ROM.

The time varies considerably by individual CD-ROM, however, and in fact only 13 of the CD-ROMs were successfully migrated (representing 8 datasets). Of the remaining CD-ROMs, some could potentially be converted, but because the data were stored in numerous separate files and because the accompanying software provided no bulk extract capability, the migration process would be very time-consuming and error-prone. It is possible that programs could be written to batch-convert data from, say, 100 individual .dbf files to comma-delimited ASCII; such work was outside the scope of this pilot project, but we believe is a fruitful area for investigation. Programming costs would of course be higher than the per-hour student worker rate, but if a data conversion program could be used on multiple CD-ROMs, the cost per CD-ROM might not be higher.

Because this project deliberately sampled CD-ROMs from a variety of collections and, within the U.S. collection, from a variety of federal agencies, we do not know what efficiencies might be gained by concentrating migration efforts on CD-ROMs from a single agency. If an agency tends to use the same proprietary software (e.g., SETS from the National Center for Health Statistics) or the same data formats for many of its data products, an institution migrating CD-ROMs from only that agency might significantly reduce the amount of time needed for analysis – because each CD-ROM would not be very different from the next, and CD-ROMs in a series (e.g., Department of Commerce import or export data) would likely be identically formatted and structured. An institution might also save migration time by applying the same workflow, scripts, or programs to several CD-ROMs from a single agency. A pilot project examining a sample of CD-ROMs from a single agency would provide documentation of this approach and would help inform a larger-scale collaborative approach, as discussed below.

Scaling the pilot

Scaling such a pilot to provide access to and preserve content from the universe of government information distributed in now-legacy formats requires attention to how efficiencies might be achieved, how data quality can be optimized, how the burden of migrating a large collection might be reduced by a decentralized model that would involve depository libraries with appropriate staffing and facilities, and how such an effort might be organized, coordinated and funded.

This pilot surfaced two main concerns that should inform future efforts: how to gain efficiencies through a different approach to selection and migration, and the question of original data quality. Pilot organizers concluded that a "hunt and peck" approach to selecting CD-ROMs for migration is too time-consuming. For any given series that utilizes common software, file, and documentation structures, an economy of scale can be achieved when a migration workflow is designed and then utilized for the whole series. If a CD-ROM "rescue" project were scaled and distributed, the work could be streamlined by making judicious selections by agency, common software platforms, and data survey series.

Including long-term preservation, as well as access, as a goal, GDIC librarians questioned whether it is prudent to take any given depository CD-ROM or DVD-ROM as the best quality source for the digital content. [15] We made introductory inquires with the National Archives and Records Administration about the possibility that data for some GPO releases were held there in an accessible format. It may also be possible on an agency-by-agency basis to locate original data sources for some CD-ROM products. If the content of any given CD-ROM is already publicly accessible and its preservation assured, then that CD-ROM need not be "rescued" – the need instead is for detailed metadata describing and linking the "legacy" CD-ROM product to its elsewhere-available content, and for that metadata to be included in the universe of FDLP CD-ROM collection metadata.

Toward a collaborative model

Access to the FDLP CD-ROM collection is currently supported at nearly 1260 libraries by depository librarians with a large collective body of knowledge about the CD-ROMs and their access and preservation challenges. Developing a robust collaborative network of depository librarians actively participating in a CD-ROM rescue project furthers a goal articulated by the Depository Library Council to the Public Printer that depository librarians "should take the lead in organizing systems for transparent, cost effective collaborations to provide services and resources to end-users and colleagues." [16] If depository libraries are to become involved in the processing of such collections, documents units should have access to local programming expertise in order to develop scripts and to utilize metadata crosswalks to facilitate migration and to auto-generate metadata wherever possible.

An ideal collaborative solution would be decentralized enough to allow participants to work on tasks that suited each institution's interest and expertise, yet coordinated enough to ensure efficiency, no duplication of effort, adherence to standards and quality control, and that all necessary tasks were claimed by participants.

An "open repository" model would allow institutions to approach tasks and sets of CD-ROMs in different ways. For example, one or more institutions might work together, and participating libraries could contribute disk images of the CD-ROMs; migrate select portions of the collection; generate descriptive, administrative, and preservation metadata; experiment with emulation tools [17]; engage federal agencies to try to obtain original data sources and documentation; assure data quality; and address copyright questions. The result would be a virtual, comprehensive collection of FDLP CD-ROMs that are currently scattered among depository libraries, most of which have incomplete tangible collections and minimal metadata.

Since the Yale pilot was completed, we have been made aware of two fledgling efforts to collect and serve content from disk images for the FDLP CD-ROMs: one based in the computer science department at Indiana University and the other at the University of California Berkeley's Doe & Moffitt Libraries. The IU SuDoc virtualization project proposes to generate tools to evaluate the software requirements of this large document collection, and to configure and automate the delivery of individual collections through emulation (virtualization). [17, 18] The UC Berkeley library project aims to provide local access using virtual machine software. [19] While these projects represent beginning approaches to addressing the issue of government information in legacy formats, project organizers recognize the need to identify a funder to scale the effort in the ways this article has discussed. For updates on recent activities, see the Yale pilot web site. [7]

Acknowledgements

We are grateful to Geoffrey Brown, Indiana University; John Hernandez, Princeton University; and Stefan Kramer, Yale University, for their valuable comments on a draft of this article. We would also like to recognize Yale University colleagues Michael Appleby, Matthew Beacom, David Gewirtz, and graduate student Chinyelum Morah, for their contributions to the pilot project. Finally, we are indebted to several colleagues in the government documents and data library communities who have generously shared ideas and advice about the CD-ROM problem and potential solutions, including Ann Green; Myron Gutmann and Amy Pienta, ICPSR; Jane Weintrop and Jerry Breeze, Columbia University; Grace York and Jennifer Green, University of Michigan; Diane Geraci, Harvard University; Valerie Glenn, University of Alabama; Will Wheeler and Elizabeth Cowell, Stanford University; and Jim Jacobs, formerly of University of California San Diego.

Notes and References

[1] John Hernandez and Tom Byrnes, "CD-ROM Analysis Project" (presentation at Spring 2004 Depository Library Council meeting, St. Louis, MO, April 21, 2004); unpublished PowerPoint available: <http://www.princeton.edu/%7Ejhernand/Depository_CD-ROM_Legacy.ppt>.

[2] In this article, "CD-ROM" is used as shorthand to refer to both CD-ROMs and DVD-ROMs.

[3] U.S. Government Printing Office, "Depository Library Public Service Guidelines For Government Information in Electronic Formats," <http://www.access.gpo.gov/su_docs/fdlp/mgt/pseguide.html>.

[4] "Legacy CD-ROMs and computers," posting to GOVDOC-L, December 1, 2006, <http://lists1.cac.psu.edu/cgi-bin/wa?A2=ind0612A&L=GOVDOC-L&P=R463&I=-3>.

[5] U.S. Government Printing Office, "Update for ALA, January 2006," <http://www.access.gpo.gov/su_docs/fdlp/events/ala_update06.pdf>.

[6] UCSD GPO Data Migration Project: <http://ssdc.ucsd.edu/dmp/>; CIC Floppy Disk Project: <http://www.indiana.edu/~libgpd/mforms/floppy/floppy.html>.

[7] For project reports and other materials, see <http://www.library.yale.edu/govdocs/cdmigration/>.

[8] Though a number of titles were available elsewhere online, the ability to locate the data and evaluate whether content matched that of the original CD-ROM was difficult due to lack of documentation, indexing, and standards to indicate data authenticity.

[9] SETS, the Statistical Export and Tabulation System, is included on many National Center for Health Statistics CD-ROMs. See: <http://www.cdc.gov/nchs/sets.htm>.

[10] Consultative Committee for Space Data Systems, "Reference Model for an Open Archival Information System (OAIS)," January 2002, <http://www.ccsds.org/publications/archive/650x0b1.pdf>.

[11] <http://www.ddialliance.org/>.

[12] <http://oregonstate.edu/~reeset/marcedit/html/index.php>.

[13] The stylesheet was created by Youn Noh, Yale's Digital Resources Catalog Librarian, and is available on the Yale Library CD-ROM migration project web site, see <http://www.library.yale.edu/govdocs/cdmigration>.

[14] <http://fedoraproject.org/>.

[15] Personal communication with Jim Jacobs, then data librarian at the University of California San Diego, June 29, 2006.

[16] Depository Library Council, "'Knowledge Will Forever Govern': A Vision Statement for Federal Depository Libraries in the 21^st Century," September 29, 2006, <http://www.access.gpo.gov/su_docs/fdlp/council/dlcvision092906.pdf>.

[17] Geoffrey Brown, "Ensuring Long-Term Access to Government Documents through Virtualization," presentation at IASSIST 2007, Montreal, May 17, 2007, <http://www.edrs.mcgill.ca/IASSIST2007/presentations/E1(1).pdf>.

[18] <http://cgi.cs.indiana.edu/~geobrown/svp/>.

[19] Harrison Dekker, "Virtual Machines in the Data Lab," presentation at IASSIST 2007, Montreal, May 17, 2007, <http://www.edrs.mcgill.ca/IASSIST2007/presentations/E1(3).pdf>.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/july2007-linden

D-Lib MagazineJuly/August 2007

Volume 13 Number 7/8 ISSN 1082-9873