Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
November 2003

Volume 9 Number 11

ISSN 1082-9873

Public Opinion Polls and Digital Preservation

An Application of the Fedora Digital Object Repository System

 

Ronald Jantz
Alexander Library
Rutgers University
<rjantz@rci.rutgers.edu>

Red Line

spacer

Introduction

The Eagleton Center for Public Interest Polling (CPIP) has been sampling the opinions of New Jersey citizens since the early 1970s. Although much of the resulting data deals with government and political issues within the state of New Jersey, there are many questions that deal with other aspects of life in the state. In effect, the complete collection of polls offers a view of the rich diversity of life in New Jersey.

In fall 2000, the Scholarly Communication Center (SCC) of the Rutgers University Libraries (RUL) entered into a partnership with the Eagleton Center for Public Interest Polling. These institutions had just received grants to provide Web access to over 30 years of public interest polling numeric data. This project was launched in the spring 2001 and was fully operational by July, 2002. In addition to standard features, such as full text indexing of all questionnaires and the ability to execute basic statistical functions on the web, a unique aspect of the Eagleton project was the agreement between the Eagleton Institute and the Rutgers University Libraries that stipulated that RUL would provide the official archival service for the Center's public interest polls. This agreement established a new role for RUL as a digital archival agent and has led to the development of a digital preservation framework as a service provided by RUL. This service can, in fact, be applied to many other types of material such as electronic journals, special digital libraries, and subject-specific digital collections. Although the approach being used for the Eagleton polls is described here, a similar approach is planned for other digital resources at RUL as well. This article provides an overview of the Eagleton Poll Archive, the website and, more specifically, the underlying preservation framework.

Eagleton Public Opinion Polls

The Center for Public Interest Polling (CPIP) at the Eagleton Institute of Politics has been conducting regular polling in New Jersey since 1971. Since 1983, Eagleton's public survey enterprise has been conducted in partnership with the Star-Ledger, the state's largest newspaper, and is now known as the Star-Ledger/Eagleton-Rutgers Poll. CPIP contributes to the public dialogue by surveying and interpreting the opinions of citizens on major policy concerns and key state issues, as well as on political races. Its key mission is to provide high-quality, non-partisan information about public opinion in New Jersey. In addition to conducting the Star-Ledger/Eagleton-Rutgers Poll, CPIP assists government agencies, non-profit organizations, and academic entities with research services.

The Eagleton Poll Archive website provides access to the public opinion polls dating back to 1971. Generally, there are four polls per year dealing with a variety of topics of interest to New Jersey citizens. These polls provide an incredibly rich resource for understanding New Jersey and its citizens. The polls are intended for use not only by researchers and those involved with public policy but also by the general public.

The complete poll questionnaires and survey results are available at the Eagleton Archive website . Users can access the polls by doing a full text search of the questionnaires. All questionnaires are completely indexed, so the result of a search will return any questions containing the specific search term or phrase. A user can then select the "frequency" or "cross-tab" functions to view results for a particular question. In addition, users can generate charts to view the frequency results graphically. A browse mode enables users to browse by subject, poll date, and poll number.

For researchers who want to do their own analysis, the full questionnaires and numeric data files in SPSS portable format can be downloaded from the Eagleton Archive website. For those using the data files, it should be noted that a weighting variable appears in the raw data. Weighting is a statistical technique used to adjust for differences between the demographic profile of the survey sample and that of the full population. For the general user of this website, these weights have been automatically applied to the survey results.

The Partnership

Rutgers University Libraries (RUL) is using digital technology to offer improved and new services to users. Although RUL continues to provide print archival support within the university, the concept of RUL becoming a digital archival agent introduces opportunities for new roles as well as poses significant challenges [1].

For the Eagleton Archive project, Professor Cliff Zukin, Director of the Star-Ledger/Eagleton-Rutgers Poll (and a member of the Public Policy Department at RU), and the author, a member of RUL's Scholarly Communication Center, included within a grant proposal that RUL serve as the digital archival agent for the Eagleton polls. This approach served a long-standing Eagleton objective to make its data publicly available and also provided an opportunity for RUL to develop both the technological infrastructure and the process for digital archiving. In parallel with the grant implementation, RUL continued to research and examine various architecture alternatives for a digital repository and preservation platform. Through these efforts as well as the activities of the Digital Architecture Working Group, the Fedora (Flexible Extensible Digital Object and Repository Architecture) Management System was selected to provide the basic technological infrastructure for digital preservation [2].

The University of Virginia and Cornell University2 jointly developed Fedora with funding provided by a grant from the Andrew W. Mellon Foundation. Earlier, related repository architectures included those described in Kahn and Wilensky [3]; Arms, et al. [4]; M¨aut;nch [5]; and Nelson and Maly [6].

Available through an open source Mozilla Public License, Fedora is designed to be a foundation upon which interoperable web-based digital libraries, institutional repositories and other information management systems can be built.

Digital Preservation

Digital preservation is a complex and evolving field that includes aspects of economics, processes, policy, and technology infrastructure. In this article, we will not discuss the ongoing debate of what to preserve. In some respects we would like to preserve everything and, in fact, this approach would greatly simplify the decision process. Storage technologies, with current prices at about one dollar per gigabyte, may likely provide this capability in the not too distant future, leaving us with the somewhat more challenging problem of defining and executing our role with regard to digital preservation. Suffice it to say, the archive of Eagleton public opinion polls was readily recognized as a valuable information source not only for researchers, politicians, government officials and students but also for the New Jersey public.

For the purposes of this article, we will restrict our focus to the preservation of electronic materials for which no comparable print format could be readily preserved. With that in mind, we define digital preservation as consisting of a set of managed activities that will: a) ensure the long-term maintenance of a byte stream sufficient to reproduce the document and b) provide continued accessibility of the contents over time and through evolving technology [1].

This definition captures the essence of digital preservation while not identifying precisely what a "document" is or how to interpret "long-term". These issues can be more readily addressed within the operational aspects of a specific archive. Note that there are a number of other definitions for digital preservation. Susan Lazinger includes several of these definitions in her book on the theory and practice of digital preservation [7].

Conceptually, the above definition is easy to understand, but on closer examination, we begin to realize the complexity of putting in place a comprehensive digital preservation framework. For example, if one uses a commercial vendor's database product and the associated software to migrate across versions of the software, one has no assurance that the byte stream has been maintained over many versions of the software. In this case, since the digital object is encapsulated in a proprietary format, techniques such as performing cyclic redundancy checks or employing other types of digital signatures are not very useful. In the end, one might not uncover until many years later a problem in the migrated data that had been introduced three or four generations earlier in the migration process. This situation dictates that the preservation architecture be based on non-proprietary and open-source technologies in order to achieve long-term sustainabililty. Similarly, in order to provide continued accessibility over time and through evolving technology, we need to think about how the content will be served to users at a future time when some of the most heavily used current operating systems and application software may no longer be available. These tasks are indeed daunting, and the framework presented here will not provide comprehensive solutions; nevertheless, we must begin the process of digital preservation and evolve as our knowledge and technology matures.

Archiving Philosophy

As Thibodeau has suggested [8], "the challenge of digital preservation must incorporate the capability to accommodate and use changing technology and the unforeseeable products of that technology." To learn more about managing this technology risk, the Scholarly Communication Center (SCC) within Rutgers University Libraries has launched a digital preservation framework referred to as REALITI (Rutgers Electronic Access to Library Information through Technology Integration). In this project, we are prototyping various technologies and processes, and much of the learning on the REALITI project is reflected in the preservation work described here. Currently, our digital preservation focus is on selected, non-licensed resources owned or managed by RUL. Through these projects, we expect to develop policies, roles, and infrastructure generic to a wide variety of materials and sustainable over time.

At RUL, our preservation architecture is based on open source software products that will enable maximum flexibility in dealing with an evolving infrastructure. We are also emphasizing base technologies and standards such as XML, Metadata Encoding and Transmission Standard - METS3 and non-proprietary persistent identifiers that have significant support in digital archival communities.

For a given collection, the decision as to what formats to archive is significant since this endeavor can be very labor intensive. Inevitably, there are many design tradeoffs to be made. Some designs might improve archival integrity, but they must be eliminated because of the investment in both people and equipment they would require.. The philosophy used for the Eagleton polls is generic and could obviously be applied to other types of collections. With Eagleton and other similar web-database projects, our objective is to archive the metadata for each object (e.g., a poll) and to archive the digital object in both a presentation format and a non-proprietary format. In the Eagleton project, this results in a single archival object for each poll that includes metadata in Dublin Core format. Four types of administrative metadata are provided: technical, rights, source, and digital provenance. For the actual poll data, there are four data streams (or byte streams) as follows:

  1. the poll questionnaire in Acrobat Reader pdf format,
  1. the numeric data in spss portable format,
  1. the poll questionnaire in plain text format, and
  1. the numeric data in comma delimited plain text format.

The complete digital object is formatted in XML and the metadata is structured according to the METS standard (see Figure 1).

There are several rationales for this level of archiving. First, we preserve the presentation copies in the repository independently of the external application. Thus, if something untoward happens to the external application, a user can still access the presentation formats through the digital repository. Secondly, as noted above, we also archive the essential data (questionnaire and numeric data) in a non-proprietary format in the event that a commercial vendor no longer supports a particular format. Although most vendors provide migration capabilities with new releases, archiving in a non-proprietary format provides an additional safeguard in the event that a vendor's file cannot be migrated. With additional research and standardization, these non-proprietary formats will ultimately serve as canonical forms [9] allowing an archive to verify that content has been preserved over migrations. In a similar vein, the metadata is exported from the proprietary database to a text-based XML format, which provides additional flexibility both in exchanging metadata and in not relying on the format of the proprietary database.

Finally, all software and data in the external application is backed up routinely according to RUL data integrity procedures. Although we occasionally modify the software to add small features or to fix software bugs, we have decided to compress all of the software for major releases into one datastream and include this object as part of the total archive.

Given the archiving philosophy outlined above, each poll in the Eagleton website will be archived in the RUL installation of the Fedora system. Fedora offers a very flexible architecture that allows the digital preservationist to make decisions that are optimal for a specific archive. Accordingly, it is important to carefully design the object architecture to be used. The object architecture for the Eagleton Archive is depicted in Figure 1.

Object architecture for the Eagleton Archive

Figure 1 - Poll Object and Collection Object

The poll object detail in Figure 1 shows the four datastreams discussed earlier. All the objects for each poll, and the software and associated database for the presentation website, are brought together under the Eagleton Poll collection level object. This object, in turn, has a METS structure map that identifies each poll in the collection for the purposes of searching, browsing, and managing through the administrative interface discussed in the next section of this article. As a new poll is added to the presentation website, the object for that poll is ingested into the archive. It should be noted here that all of the metadata and the METS wrapper is generated automatically, which minimizes manual overhead for placing a poll object in the archive.

Collection Level Object

The collection level object includes metadata about the collection as a whole and also points to all of the objects in the collection. It should also be noted that the Eagleton application was designed and is executed on a Windows NT platform. Although RUL is moving away from this platform for digital projects, there are numerous existing Windows-based projects requiring preservation, and operating system characteristics need to be clearly defined in the metadata. Accordingly, the technical component of the METS administrative metadata describes the OS environment for the Eagleton project. The structure and function of the collection level object are illustrated in Figure 2 and the object in XML format can be viewed at <http://www.scc.rutgers.edu/altek/eagleton/obj-eagleton-collection.xml>.

Image showing the structure and function of the collection level object

Figure 2 - Collection Level Object

The Digital Preservation Framework and Archiving Process

This section outlines the functions in the digital preservation framework and briefly describes the standards and technologies being used. The diagram in Figure 3 represents a generic application (e.g., the Eagleton Poll Archive) that is external to, and functions independently of, the digital archival repository. The application database contains the descriptive metadata for each poll and the full text of each question. The primary digital objects are the questionnaires in Adobe Acrobat PDF format and the numeric data in SPSS portable file format. As discussed above, the metadata and data streams are exported in METS-XML format for each poll and ingested into the RUL Digital Repository. This process is depicted in Figure 3.

Diagram showing a generic application of the digital object repository

Figure 3 - The Digital Preservation Framework and Archiving Process

The Eagleton website administrator archives each poll after the poll has been imported into the Eagleton application website. Archiving is accomplished via a link on the administrator's page labeled "Archive" that initiates a series of actions as described below and illustrated within the dotted box in Figure 3. Four datastreams will be archived for each poll: 1) the poll questionnaire in pdf format, 2) the numeric data in SPSS .por format, 3) the poll questionnaire in text format, and 4) the numeric data in text tab delimited format. On selection of the "archive" link, a Fedora-XML file will be created. (For an example of the specific format for a poll digital object, see <http://www.scc.rutgers.edu/altek/eagleton/obj-eagleton-001.xml>.) The basic functional steps are performed automatically as follows:

  1. Map the descriptive metadata to Dublin Core.
  1. Assign a handle (see the CNRI Handle System® website4) by querying the SCC handle server. The handle syntax for an Eagleton poll is: "1782.1/eagleton.poll.[poll number]". Note that this handle is an external handle and is different from the internal ID shown in the poll object detail of Figure 1. For example, the handle for Eagleton Poll 001 is "1782.1/eagleton.poll.001".
  1. Insert the handle into a Dublin Core identifier in full URL format (for example: <http://hdl.rutgers.edu/1782.1/eagleton.poll.001>).
  1. Compute a digital signature5 for each of the four datastreams and insert the signature into the digital provenance metadata for the digital object.
  1. Generate the METS administrative metadata sections: technical, source, rights, and provenance.
  1. Create the XML for each datastream, including the insertion of "M" into the OWNERID tag for each data stream. This action insures that a copy of the file will be ingested into the Fedora archive.
  1. Output the complete XML file and call the Fedora ingest function to create the poll object in the RUL digital repository.

Management and Migration

The process for putting new polls online is straightforward and, for the most part, automated. Periodically (about four times every year on average), the Eagleton Center for Public Interest Polling (CPIP) will electronically send a questionnaire and the numeric data for a new poll to the Scholarly Communication Center of Rutgers University Libraries. Using the website administrative interface, staff at the SCC will import the questionnaire and SPSS files into the website application and do the full text indexing. After this import step, CPIP uses the administrative interface to assign subject headings to each question in the poll. The final step is to archive the poll to the digital repository as described in the previous section.

In order to manage digital projects into the future, the SCC assigns a project owner to each project. This person is responsible for addressing technological problems and working with the principal investigator to provide current engineering support and to improve or enhance the website as necessary. Included in these responsibilities are the tasks of migration. The migration process remains a significant step in the overall management process, and it is typically driven by external events such as a new version of software being released, a vendor going out of business, media failure, corruption (e.g., from a virus) or more unthinkable events, such as natural and man-made disasters. To manage and migrate material in the RUL digital repository, a web-based management interface is available that enables one to search, browse, and view the metadata and the datastreams that have been preserved for a particular collection and object. In addition, objects can be ingested and purged. Key migration events that can be readily managed include upgrading to a new version of a commercial database or installing a new version of Fedora. Given the multiple copies of the objects, both in the presentation website and the digital repository, and in physical separate online and off-site storage areas, it should be possible to restore or migrate any data that has succumbed to unpredictable events like a corrupted object or an attack by a hacker. Technological assists such as the handle and the digital signature will help insure both referential and content integrity. It should be noted that there is considerable work to do in this area before a standard can be defined and promulgated (see for example, <http://www.imsglobal.org/digitalrepositories/driv1p0/
imsdri_bestv1p0.html
>).

Digital preservation strategies and decisions will require many tradeoffs to balance costs and provision of a high degree of long-term integrity for the data. RUL is working diligently to put in place the management framework and policies required in the critical areas of project creation, description, maintenance, and management. We are, as well, learning from others who have been advancing the state of the art in this area. We have used and adapted certain aspects of the National Library of Australia's preservation metadata6, and Project Prism at Cornell [10] has been very useful in clarifying risk management strategies. At the infrastructure level, we believe our directions are compatible with the logical architecture proposed by the Library of Congress [11]. Special focused research has also been extremely helpful, such as Lavoie's [12] dealing with the economics and incentives to preserve digital materials.

Since the Rutgers University Library system is committing to long-term digital preservation, this service requires ongoing training for the management and execution of migration processes yet to be defined or that will change over time. As others have suggested [13], there is a strong basis of knowledge available, especially for static digital objects, and while the practices available are not perfect, they are sufficient to serve as a basis for implementation. RUL is using this knowledge base to begin the important work of digital preservation.

Conclusions

The variety and complexity of digital objects are at times overwhelming and will be increasingly so as we develop new technologies for delivering information. It isn't practical to try to preserve all of this complexity. Ultimately digital preservation becomes a matter of trust in some person, group, commercial enterprise or institution [14]. Digital preservation and archiving are natural extensions of the traditional roles of academic libraries with regard to the preservation of non-digital content. As an institution, the library offers a degree of permanence in academia not easily provided by other university departments or by commercial information organizations. Although considerable change will be required in roles, processes, and policies, it seems appropriate that academic libraries take on this challenge. The Eagleton Poll Archive represents one of the first such projects for Rutgers University Libraries; however, we expect the insight and experience gained from this project will also benefit many of our other digital projects.

Acknowledgment

Bringing the Eagleton Polls online has been a collaborative effort between the Eagleton Center for Public Interest Polling and the Scholarly Communication Center of Rutgers University Libraries. The Co-Principal Investigators on the grant-funded work were Professor Cliff Zukin, Director of Star-Ledger/Eagleton-Rutgers Poll and Professor in the Public Policy Department at Rutgers, and Ron Jantz, Government & Social Sciences Data Librarian at Rutgers University Libraries. In addition, the following from the Eagleton Institute provided guidance and consultation throughout the development: Professor Monika McDermott (now at the University of Connecticut), Patrick Murray (Associate Director at Eagleton, CPIP), and Thomas Regan (Research Analyst and Manager of Information Services). In addition, there have been many student assistants from Eagleton/CPIP, the Rutgers Computer Science Department, and the Rutgers School of Communication, Information, and Library Studies who have provided their services both in the development and continuing support of the Eagleton Poll Archive.

Notes

1 Eagleton Archive website, <http://www.scc.rutgers.edu/eagleton>.

2 Fedora, <http://www.fedora.info>.

3 Metadata Encoding and Transmission Standard - METS, <http://www.loc.gov/standards/mets>.

4 The Handle System®, <http://www.handle.net>.

5 IETF W3C, XML-Signature Syntax and Processing, <http://www.w3.org/TR/xmldsig-core>.

6 National Library of Australia preservation metadata, <http://www.nla.gov.au/preserve/pmeta.html>.

References

[1] Research Library Group, (2002). Trusted Digital Repositories: Attributes and Responsibilities - An RLG Report. Mountain View, California. Available at: <http://www.rlg.org/longterm/repositories.pdf>.

[2] Staples, T., Wayland, R. and Payette, S. (2003). The Fedora Project: An open-source digital object repository management system. D-Lib Magazine, 9(4). Available at <doi:10.1045/april2003-staples>.

[3] Kahn, Robert and Robert Wilensky, "A Framework for Distributed Digital Object Services," Corporation for National Research Initiatives, 1995, Available at <http://www.cnri.reston.va.us/k-w.html>.

[4] Arms, William Y., Christophe Blanchi, and Edward A. Overly, "An Architecture for Information in Digital Libraries," D-Lib Magazine, February 1997. Available at <doi:10.1045/february97-arms>

[5] Möau;nch, Christian, "INDIGO - An Approach to Infrastructures for Digital Libraries," Fourth European Conference on Research and Advanced Technology for Digital Libraries, Portugal, Springer, 2000, Lecture Notes in Computer Science, Vol. 1923.

[6] Nelson, Michael L. and Kurt Maly, "Buckets: Smart Objects for Digital Libraries," Communications of the ACM, 44(5), May 2001, pp. 60-62.

[7] Lazinger, S. (2001). Digital Preservation: History, Theory, Practice. Englewood, CO: Libraries Unlimited, A Division of Greenwood Publishing Group, Inc.

[8] Thibodeau, K. (2002). Building the archives of the future: Advances in preserving electronic records at the National Archives and Records Administration. D-Lib Magazine, 7(2). Available at <doi:10.1045/february2001-thibodeau>.

[9] Lynch, C. (1999). Canonicalization: A fundamental tool to facilitate preservation and management of digital information. D-Lib Magazine, 5(9). Available at <doi:10.1045/september99-lynch>.

[10] Kenney, A., McGovern, N., Botticelli, P., Entlich, R., Lagoze, C., and Payette, S. (2002). Preservation risk management for web resources: Virtual remote control in Cornell's Project Prism. D-Lib Magazine, 8(1), Available at <doi:10.1045/january2002-kenney>.

[11] Library of Congress, (2003). Plan for the National Digital Information Infrastructure and Preservation Program. Available at <http://www.digitalpreservation.gov/repor/ndiipp_plan.pdf>.

[12] Lavoie, B. (2003). The Incentives to Preserve Digital Materials: Roles, Scenarios, and Economic Decision-Making. OCLC Online Computer Library Center, Inc. Available at: <http://www.oclc.org/research/projects/digipres/incentives-dp.pdf>

[13] Hedstrom, M. (2002). The digital preservation research agenda. Available at: <http://www.clir.org/pubs/reports/pub107/hedstrom.html>. Conference Proceedings, Documentation Abstracts, Inc., Institutes for Information Science, Washington, D.C., April 24-25, 2002

[14] Bellinger, M., Campbell, L., Hedstrom, M., Marcum, D., Thibodeau, K., Waters, D., van der Werf, T., and Webb, C. The state of digital preservation: An international perspective. Available at: <http://www.clir.org/pubs/abstract/pub107abst.html>. Conference Proceedings, Documentation Abstracts, Inc., Institutes for Information Science, Washington, D.C., April 24-25, 2002.

Copyright © Ronald Jantz
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | Next Article
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/november2003-jantz