Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism

D-Lib Magazine
January 2002

Volume 8 Number 1

ISSN 1082-9873

Preservation Risk Management for Web Resources

Virtual Remote Control in Cornell's Project Prism

Anne R. Kenney ([email protected])
Nancy Y. McGovern ([email protected])
Peter Botticelli ([email protected])
Richard Entlich ([email protected])
Carl Lagoze ([email protected])
Sandra Payette ([email protected])
Cornell University

Main Project Prism: <http://www.prism.cornell.edu/>
Library Project Prism: <http://www.library.cornell.edu/preservation/prism.html>

Actuaries spend their careers figuring out what benefits a company should offer, at what price, and for how long. Their job is to make sense of all the empirical and statistical evidence of age, gender, health, heredity, life styles, physical habits, and living and working conditions that serve as indicators of longevity, productivity, and obligation. How well they do their job depends on how good their evidence is, how skilled they are at reading it, and how risk tolerant their customers are.

Archivists and research librarians interested in preserving Web resources face a similar challenge. Libraries increasingly depend on digital assets they neither own nor manage. This article describes current Web preservation efforts by libraries and archives and suggests how a new preservation strategy could use a risk management methodology. Cornell's Project Prism is exploring technologies and tools to assess the lifestyle and habits of the Web that research libraries and other entities can monitor and use to develop retention policies for online resources.

Project Prism's approach begins with characterizing the nature of preservation risks in the Web environment, develops a risk management methodology for establishing a preservation monitoring and evaluation program, and leads to the creation of management tools and policies for virtual remote control. The approach will demonstrate how Web crawlers and other automated tools and utilities can be used to identify and quantify risks; to implement appropriate and effective measures to prevent, mitigate, recover from damage to and loss of Web-based assets; and to support post-event remediation. Project Prism is producing a framework for developing an ongoing comprehensive monitoring program that is scalable, extensible, and cost effective.

Growing Dependence on Web-Accessible Resources

Academic libraries have dramatically increased their offerings of online resources. A survey of the 21 members of the Digital Library Federation revealed that 40% of their costs for digital libraries in 2000 went for commercial content.¹ The big-ticket items were electronic scholarly journals that libraries license rather than own. Yet little direct evidence shows that publishers have developed full-scale digital preservation capabilities to protect this material, and research libraries continue to purchase the print versions for preservation purposes. However, none appears ready to forgo access to the licensed content just because its long-term accessibility might be in question.

Research libraries are also including in their catalogs and gateways more open-access Web resources that are not covered by licenses or other formal arrangements. A spring 2001 survey of Cornell's and Michigan's Making of America collections revealed that nearly 250 academic institutions link directly to the MOA collections, although neither university has committed to provide other entities with long-term access. Similarly, a review of the holdings of several research library gateways over the past few years indicates growth in the number of links to open-access Web resources that are managed with varying degrees of control.² Approximately 65% of the electronic resources on Cornell's Gateway are unrestricted, and additional open resources are included in aggregated sets that are available only to the campus community.³ One of the links is to the University of California, Berkeley's CPU Info Center. This resource is notable because Tom Burd, the site manager, has done several things to advance its preservation, including establishing a mirror site, documenting changes, and providing a checksum in the source page. A recent note posted on this site, however, demonstrates how fragile such resources can be:

"I am no longer affiliated with U.C. Berkeley, and it has become very difficult to maintain this site. With the state of the web now, as compared to when I started this site in 1994, I'm not sure if it even warrants continuing on in light of many other online resources. As such, I will probably bring this site to a close in the coming weeks. If someone wanted to take over maintaining this site, I would be happy to tar up all the files and hand them over. Please drop me a line if you are interested�" ⁴

Estimates put the average life expectancy of a Web page between 44 days and two years and a significant proportion of those that survive undergo some change in content within a year. Since 1998, OCLC's Web Characterization Project has tracked trends in growth and content of the publicly available Web space. One of the more revealing statistics, IP Address Volatility, identifies the percent of extant IP addresses from one year to the next. In a fairly consistent trend since 1998, slightly over half (55-56%) the IP addresses identified in one year are still available the next. Within two years, a little over a third (35-37%) remain. Four years later only 25% of the sample 1998 IP addresses could be located.⁵

OCLC's annual review points to the instability of Web resources; it doesn't indicate whether those resources still exist elsewhere on the Web or whether the content has changed. While some resources disappear, others become unfindable due to the well-known problem that URLs change.⁶ A recent preservation review of the 75 Smithsonian Institution Web sites noted that an exhaustive search could not locate a copy of the first Smithsonian Web site, created in 1995. A URL may persist while content changes wildly: the editors of RLG DigiNews discovered that links in several past issues pointed to lapsed domain names that had been converted by others into porn sites.

Much attention has been paid to unstable URLs and to creating administrative/preservation metadata, but to date no evidence suggests that research libraries are privileging open access sites that utilize some form of URN or that document content change. Even if such precautions were fully implemented, Web resources are particularly vulnerable to external attacks. This past year, the Internet was hit hard by the Nimda worm, which took down 150,000 computers, and the Code Red virus, which struck more than 12,000 Web sites in the U.S. In June 2001, Microsoft had to issue a patch to protect its attack-vulnerable Internet Information Server (IIS) software, which is used by approximately 16 million Web sites.⁷

Current Web Preservation Efforts by Libraries and Archives

With the growing dependence on external digital assets, libraries and archives are undertaking some measures to protect their continued use of these resources. Efforts can be grouped into three areas: collaborating with publishers to preserve licensed content, developing policies and guidelines for creating and maintaining Web sites, and assuming archival custody for Web resources of interest.

Licensed Content

Publishers and librarians alike are grappling with how best to preserve licensed content. Publishers are developing their own preservation strategies as they realize the commercial benefits of creating deep content databases. Elsevier Science among others has committed to building an electronic back file collection for all its publications and intends to maintain these electronically archived copies "forever." In 1999 the company developed a formal archiving policy (updated in 2001) that has been added to all licenses for ScienceDirect.⁸ A number of publishers are also working with third parties to back up, store, and refresh digital content. OCLC recently announced the formation of the Digital and Preservation Resources Division to provide integrated solutions for creating, accessing, and preserving digital collections. With planning grants from The Andrew W. Mellon Foundation, seven research libraries and key commercial and scholarly publishers are exploring formal archiving arrangements for e-journals. In 2002, Mellon intends to fund up to four continuing projects to gain practical experience with the functions and costs of constructing and operating e-journal archives for several years.⁹

Policies and Guidelines for Creating and Maintaining Web Sites

Recommendations for building Web sites have addressed digital preservation indirectly. The World Wide Web Consortium's (W3C) Web Content Accessibility Guidelines, Techniques, and Checklist provide some recommendations for good resource management (e.g., use of standard formats and backward compatible software) and have had a major impact on the development of Web materials worldwide, especially for institutions affected by legislative requirements to meet the needs of disabled users, as outlined in such legislation as the Americans with Disabilities Act and the Rehabilitation Act (section 508).¹⁰ However, the W3C guidelines do not expressly address content stability, documentation of change, or good database management. In fact preservation and records management issues are noticeably absent.

In the United States, Web preservation is more directly supported through government policies and guidelines to promote accountability, spurred in part by such legislation as the Paperwork Reduction Act.¹¹ For at least the past five years, Charles R. McClure and J. Timothy Sprehe have investigated policies and guidelines affecting state and federal agency Web sites.¹² Their 2000 study, Performance Measures for Federal Agency Websites, evaluated federal policies and defined criteria and performance measures—including those pertaining to record keeping—for assessing agency compliance with those policies.¹³

Governments are also promulgating specific policies and recommendations for preserving government-supported Web content. In January 2001, the U.S. National Commission on Libraries and Information Science published "A Comprehensive Assessment of Public Information Dissemination," which recommends legislation that would "formally recognize and affirm the concept that public information is a strategic national resource."¹⁴ Another recommendation is to "partner broadly, in and outside of government, to ensure permanent public availability of public information resources."

The archivist's perspective has been quite influential, as arguments are advanced to treat Web sites as important records in their own right.¹⁵ National archives in many countries are developing policies and guidelines. The U.S. Federal Records Act, as amended, requires that agencies identify and transfer Web site records to agency record keeping systems, including the National Archives and Records Administration (NARA), for permanent retention.¹⁶ NARA has issued several bulletins on the disposition of electronic records that include Web sites.¹⁷ It has also slowly begun to respond to this new form of record keeping and has appraised at least one federal Web site as a permanent record. In late 2000, NARA established an initiative to capture a snapshot of all federal Web sites at the end of the Clinton Administration.¹⁸ NARA has also contracted with the San Diego Supercomputer Center for a project to investigate the preservation of presidential Web sites.¹⁹

The National Library of Australia (NLA) has been a world leader in promulgating guidelines for preservation. In December 2000 the NLA issued Safeguarding Australia's Web Resources, which provides advice on creating, describing, naming, and managing Web resources.²⁰ The Council on Library and Information Resources funded NLA's Safekeeping Project, which targets 170 key items accessible through PADI (Preserving Access to Digital Information).²¹ NLA staff wrote to the resource managers encouraging them to voluntarily preserve these materials and outlined nine strategies for long-term access.²² Responses have been received from 116 resource owners and to date, safekeeping arrangements have been made for 77 items. Negotiations are in progress for an additional 33 resources. Eight resource owners lacked the appropriate infrastructures to comply with the recommendations. Alternative "safekeepers" have been approached for four of these. By the end of 2001, 54 resource owners had not responded.²³

Assuming Archival Custody

The third major focus of Web preservation has been to identify and ingest Web content into digital repositories. The best-known example is the Internet Archive, a not-for-profit organization associated with Alexa Internet, which has been automatically collecting all open access HTML pages since 1996.²⁴ Also in 1996, the National Library of Australia's Pandora adapted Web crawling to archive selected Australian online publications.²⁵ That same year, the Royal Library of Sweden launched Kulturarw3 to collect, preserve, and make accessible Swedish electronic documents published online.²⁶ For Pandora, ingest includes manual creation and/or clean up of metadata and the establishment of content boundaries. This approach may be cost effective for a few highly valuable documents, but may be prohibitively expensive for large collections. Important Web archiving projects continue throughout the world.²⁷

On October 24, 2001, the Internet Archive released the Wayback Machine, which lets users view snapshots of Web sites as they appeared at various points in the past. With over 10 billion Web pages exceeding 100 terabytes of data and growing at a rate of 12 terabytes a month, the Internet Archive provides the best view of the early Web as well as a panoramic record of its rapid evolution over the past five years. It provides an invaluable tool for documenting change and filling some of the void in record keeping in the Web's early days. We owe a debt of gratitude to the founders of the Internet Archive for the foresight and plain boldness of such an imposing task. Nevertheless, it would be a mistake to conclude that the challenge has been met and the rest of us can relax. As impressive as the accomplishments of the Internet Archive are to date, this approach to Web preservation is only part of the solution to a much larger problem.

The Internet Archive and similar efforts to preserve the Web by copying suffer from common weaknesses that they readily acknowledge:²⁸

Snapshots may or may not capture important changes in content and structure.²⁹

Technology development, including robot exclusions, password protection, Javascript, and server-side image maps, inhibits full capture.

A Web page may serve as the front end to a database, image repository, or a library management system, and Web crawlers capture none of the material contained in these so-called "deep" Web resources.³⁰

The sheer volume of material on the Web is staggering. The high-speed crawlers used by the Internet Archive take months to traverse the entire Web; even more time would be needed to treat anomalies associated with downloading. Not all sites merit the same level of attention, especially given limited resources, and means must be devised for honing selection and treating materials according to their needs.

Automated approaches to collecting Web data tend to stop short of incorporating the means to manage the risks of content loss to valuable Web documents.

File copying by itself fails to meet the criteria RLG and OCLC have identified in Attributes of a Trusted Digital Repository.³¹ For example, the Internet Archive has not overtly committed to continued access through changing file formats, encoding standards, and software technologies.

Legal constraints limit the ability of crawlers to copy and preserve the Web.

Project Prism: Preservation Risk Management for Web Assets

Web preservation efforts to date address major areas of concern, but fail to consider the challenge of preserving content that an institution does not control or for which it cannot negotiate formal archiving arrangements or assume direct custody. Over time, preserving Web content will require substantial resource commitments, as well as flexible and innovative approaches to changes in technologies, organizational missions, and user expectations. Cornell University's Project Prism is a joint research effort by the Computer Science Department and the University Library to support libraries and archives as they extend their role from custodians of physical artifacts to managers of selected digital objects distributed over the network. Digital curatorial responsibilities will need to be reconsidered and undertaken in light of cost, level of participation by cooperative or uncooperative partners, and technical feasibility. At the same time, we aim to design archiving tools and services that will enable non-librarians to raise the information integrity of research collections that are now managed haphazardly, if at all. Ultimately, we seek an approach to archiving distributed Web content that takes custody of digital files as a last resort, though the methodology could also be used for pre-ingest management.

We are exploring a noncustodial, distributed model for archiving, in which resources are managed along a spectrum, from, at the highest level, a formal repository to, at the lowest level, the unmanaged Web. One of our goals is to show how the integrity of unmanaged resources can be raised at minimal cost, using automated routines for monitoring and validating files according to policies established by organizations that value the longevity of those resources. Our overall goal is to create archiving tools that will enable libraries, archives, commercial database providers, scholarly organizations, and individual authors to manage different sets of risks affecting the same resources remotely.

A risk-based preservation management program begins with two key questions: what assets may be at risk and should be included in the program, and what constitutes risks to those assets? Risk is a relative term—an event or threat may be risky in one environment but not in another.³² Therefore, risk management programs should be developed and implemented within an organizational context: each institution will need to define its own "worry radius"—the context that provides definitions of perceived risk and acceptable loss.³³ Effective risk management also requires determining the scope and value of assets. The cost of implementing the program should be appropriate to the estimated value of the assets and the impact of their loss on operations and services.³⁴

Overview of Risk Management

Risk management is becoming a business in itself. That was true before September 11, but in its wake demand for risk management policies, organizations, consortia, and consultants has escalated.³⁵ The Internet is crammed with tools, guides, and services for every size and type of organization or industry.³⁶ Insurance providers, healthcare providers, nonprofit organizations, environmental monitors, and financial investors are all particularly active areas in these developments.³⁷ On the business side, the literature reflects an evolution from disaster planning to business continuity to risk management. The rapidly growing areas of capital assets and digital assets management are closely aligned with these risk management developments. In the academic arena, institutions such as the Wharton Risk Management and Decision Processes Center are developing multidisciplinary approaches that combine methods and techniques from a wide range of domains, including decision analysis, public policy, economics, political science, and psychology.³⁸

The boom in risk management has not applied to digital preservation. The NLA's Archiving Web Resources - Guidelines contains a chapter on assessing risk, but it largely addresses the need to track changes to Web sites for record keeping, primarily to reduce liability and accountability risks to the organization, not the risks to its Web-based content. Risk Management of Digital Information: A File Format Investigation develops a risk management methodology for migration.³⁹ The report dissects the migration process and identifies risk categories and specific risks. Project Prism will adapt this kind of risk management to Web-based materials.

Much of the risk management literature presents practical, commonsense approaches to generic problems or domain-specific requirements. Even traditional risk management is addressing technology-related issues. Regarding Web resources, the literature acknowledges that a key challenge is to balance flexible access to Web sites against the security needed to protect them.⁴⁰

Risk Management Models

Many of the proposed models cited in the literature share a common progression for establishing a risk management program.⁴¹ Project Prism has four main phases that map well to the typical states of risk management programs.

Table showing risk management stages

Table 1: Risk Management Stages

Risk identification is the process of detecting potential risks or hazards through data collection. A range of data collection and manipulation tools and techniques exists.⁴² In Phase 1 of Prism, the team is using both automated and manual techniques to collect data and begin to characterize potential risks to Web resources. Web crawling is one effective way to collect information about the state of Web pages and sites. The Prism team employs the Mercator Web crawler to collect and analyze data to test hypotheses about the relationship between observable characteristics of Web resources and threats to longevity.⁴³ The modular and extensible nature of Mercator makes it a powerful tool for customized analyses.

Risk classification⁴⁴ is the process of developing a structured model to categorize risk and fitting observable risk attributes and events into the model. The OECD's Chemical Accident Risk Assessment Thesaurus (CARAT™) is a good example of such a risk classification.⁴⁵ The Prism team combines quantitative and qualitative methods to characterize and classify the risks to Web pages, Web sites, and the hosting servers.⁴⁶

Risk assessment is the process of defining relevant risk scenarios or sequences of events that could result in damage or loss and the probability of these events.⁴⁷ Many sources focus on risk assessment. Rosenthal describes the characteristics of a generic standard for risk assessment as "transparent, coherent, consistent, complete, comprehensive, impartial, uniform, balanced, defensible, sustainable, flexible, accompanied by suitable and sufficient guidance."⁴⁸ Variables to consider in assessing risk include the value of assets, possible threats, known vulnerabilities, likelihood of loss, and potential safeguards. In Project Prism, we are defining a data model for storing risk-significant information. This model reflects key attributes about Web assets, observed events in the life of these resources, and information about the resources' environment. A key aspect of risk assessment in Prism is defining and detecting significant patterns that may exist in this data.

Risk analysis determines the potential impact of risk patterns or scenarios, the possible extent of loss, and the direct and indirect costs of recovery.⁴⁹ This step identifies vulnerabilities, considers the willingness of the organization to accept risk given potential consequences, and develops mitigation responses.⁵⁰ Artificial intelligence methods, decision support systems, and profiles of organizations all support risk analysis. The resulting knowledge and exposure databases provide evolving sources of information for analyzing potential risks.⁵¹ Project Prism is developing a knowledge base that could be characterized as a risk analysis engine.

Risk management implementation defines policies, procedures, and mechanisms to manage and respond to identifiable risks. The implemented program should balance the value of assets and the direct and indirect costs of preventing or recovering from damage or loss. The program should be known and understood both within the organization and by relevant stakeholders.⁵² An effective program includes comprehensive scope, regular audits, tested responses and strategies, built-in redundancies, and openly available, assigned responsibilities.⁵³

Bringing all of the pieces together for a fully implemented risk management program involves establishing holistic policies and compliance monitoring, developing ways to measure program effectiveness, managing the development and deployment of countermeasures, identifying incentives, building the risk management team, and developing or adapting supporting tools for the program.⁵⁴

Automated Strategies to Support Preservation Risk Management

Project Prism is exploring technologies that will form the basis for a suite of tools to support risk-based preservation monitoring and evaluation of Web resources. From a technical perspective, our goal is to design feasible and appropriate mechanisms for off-site monitoring. Assuming that over time libraries and other information intermediaries will extend their collecting scope over greatly increasing amounts of distributed content and that the longevity of these resources will be a primary concern, automatic methods will be needed to deal with such volume cost effectively and for consistent results that are less prone to human error. The methods will need to accommodate content providers who both cooperate in the effort, for example by contributing metadata, or content providers who, while not hostile to the idea of monitoring, are not collaborating. The methods will also need to be flexible enough to suit the variety of management requirements of diverse institutions.

These monitoring mechanisms should be deployable in a range of systems contexts. For a university research library, that context might be a management system used to collect lists of URLs that faculty and librarians have deemed important through some rating scale. The library might then employ the monitoring schemes outlined in the rest of this section as it assumes a role of "managing agent" for those external resources. At the other end of the spectrum, a preservation service might be a program that users could install on their own workstations to monitor Web resources of their own choice. This tool could be launched like other utility tools such as a disk defragmenter or an anti-virus scanner.

The Web resources within an organization's worry radius might be a Web site, a subset of resources in a Web site, or a single Web page or document. Furthermore, a Web resource might live in an individual's informally managed Web page or in an organization's highly controlled Web site. Defining the boundaries of a Web resource for preservation monitoring is not easy. Mechanisms for preservation risk management must address four levels of context:

A Web page as a stand-alone object, ignoring its hyperlinks.

A Web page in local context, considering the links into it and out from it.

A Web site as a semantically coherent set of linked Web pages.

A Web site as an entity in a broader technical and organizational context.

Time is part of each of these contexts. For risk analysis, some threats can be detected from the examination of a single static snapshot of a resource, while other threats become visible through analysis of how the resource changes over time. Project Prism is concerned with both the snapshot view and the time-elapsed view. For each of the four contexts, we hypothesize appropriate technical approaches for risk detection. By testing these hypotheses we can transform our results into the suite of tools we need.

Monitoring a Web Page as a Standalone Object

As a stand-alone object, a Web page must be considered without regard to its hyperlinked context. What risk attributes are visible by looking at a single Web resource minus its link structure? Given a one-time snapshot of a single Web page, automated tools can observe these significant features:

Tidiness of HTML formatting: Just as sloppy work habits reflect badly on an employee, untidy HTML is a reason for some unease about the management of a Web resource. While early versions of HTML had poorly defined structure, the recent redefinition of HTML in the context of XML (XHTML) has now formally defined HTML structure.⁵⁵ The TIDY tool makes it possible to determine how well an HTML document conforms to this structure, revealing the sophistication and care of the page's manager.⁵⁶

Standards conformance: Data format standards, such as the popular JPEG image standard, change over time, sometimes making previous versions unreadable.⁵⁷ A monitoring mechanism could automatically determine whether a Web resource conformed to current standards. Conformance to open standards could also be considered. Arguably, Web resources formatted according to a nonpublic standard—for example Microsoft Word documents—may be a greater longevity risk than those formatted to public standards. On the other hand, industry dominance can privilege some proprietary formats over formats that are standard but not widely adopted, e.g., PNG.⁵⁸

Document structure: Like HTML formatting, a document that manifests good structure, in the manner of a good research paper, may be more dependable than one that consists of text with no apparent order. Automated digital libraries such as ResearchIndex have had success with heuristics for deriving structure from PDF, PS, and HTML documents.⁵⁹ These techniques could be used to measure the level of structure in a Web resource.

Metadata: The presence or absence of metadata tags conforming to standards such as Dublin Core may indicate the level of management.⁶⁰

Automatic mechanisms could track the following characteristics over time:

HTTP response code: The HTTP protocol defines response codes that indicate transfer error or success.⁶¹ An off-site monitor could record the incidence of HTTP response codes over time and certain patterns of codes, such as a high frequency of 404 (page-not-available) codes, could be used to measure risk.

Response time: A server with widely fluctuating response times or consistently slow response time indicates a higher level of risk than one that is responsive.

Page changes: For certain types of pages, no changes at all might indicate complete lack of management or maintenance. On the other hand, unpredictable and large changes might indicate chaotic management. Pages that change on some predictable schedule with some predictable delta might indicate high-integrity management. Monitoring mechanisms that employ copy detection methods or page-similarity metrics would be useful for developing a measurement for page changes over time.⁶²

Page relocation: The lack of persistence of URLs is a well-known problem. Certainly, the disappearance of a selected resource, evidenced by consistent "page-not-found" errors, should be a cause for alarm. Techniques such as "robust hyperlinks" might make it possible to track the movement of a resource across the Web and use that movement and/or replication to determine risk.⁶³

Monitoring a Web Page in a Hyperlinked Context

The hyperlinked structure of a Web page, its in-links and out-links, has been successfully exploited in the development of better Web search engines.⁶⁴ Similarly, such "link context," the links out from a page and the links from other pages to that page, may prove useful in deducing longevity risks.

Using a page snapshot, risks can be detected by analyzing:

Out-link structure: Consider a page that links to a number of pages on the same server, in contrast to another page that either has no out-links or only links to pages on other servers. Intuitively, the "intralinked page" may be more integrated into a site and at lower risk. Pages with no links at all might be considered highly suspicious, having the appearance of "one-offs" rather than long-term Web resources.

In-link structure: An equal if not greater indicator of longevity risk is the number of links from other pages to a page and the nature of those links. Isolated pages, ones with no in-links, should be highly suspect. Ascertaining the absence of in-links in the Web context is hard, since it requires crawling the entire Web. Two more tractable and meaningful in-link measurements are:

Intra-site links—As noted, a page that is integrated into a Web site structure seems more trustworthy than one not pointed to by any pages on its site. It is possible to crawl that Web site—defined by stripping the page URL down to its root dns component—to determine if any page on that site links to the page in question.

Hub links—Kleinberg's HITS algorithm describes the method for finding authoritative Web resources relative to a specific query.⁶⁵ The presence or absence of links to a page from one or more of these authoritative Web resources might be an indicator of risk. In related work, we are developing methods for classifying Web pages automatically in collection categories, each of which is characterized by a set of authoritative pages on the Web. We could then initiate a Web crawl from these authorities and find direct or "transitive" links to a given page.

Page provenance: The URL of a Web page can itself provide metadata about the page's provenance and management structure. The host name often provides useful information on the identity (the "address") of the Web server hosting a page, and, less reliably, the name of the institution responsible for publishing the page. A top-level domain name can help classify a publishing organization by type (.edu, .gov, .com). Project PRISM will investigate the correlation between top-level domain name and preservation risks.⁶⁶ Also, the path name may provide clues about organizational subunits that may be responsible for managing a Web page or site. In Illustration 1, "preservation" in the path name may indicate a department or subunit of Cornell University Library, although it could also refer to the topic of preservation—either way, it may help establish responsibility for the page.

Link volatility: Once the nature of the links to and from a page is determined, it is useful to compare changes in those links over time. If out-links are added or updated, a page is evidently being maintained and is at reduced risk. A decrease in in-links may indicate approaching isolation and should cause concern.

An illustration diagramming the parsing process for a URL

Illustration 1: Parsing the URL

Monitoring a Web Site

There will be many cases where the unit of preservation is a Web site—a coherent collection of interlinked pages rather than a single page. The notion of a Web site lacks good formal definition, with just a few ideas on how to define a metadata structure for such an entity.⁶⁷ McClure and Sprehe define a Web site as "a set of Uniform Resource Locators (URLs) that fall under a single administrative control."⁶⁸ For Prism, a Web site is a set of URLs that are syntactically appended to a root URL.

For example, a root URL:
http://my.org/document/root.html

has linked pages with URLs like:
http://my.org/document/a.html
http://my.org/document/aub/b.html

but not like:
http://your.org/z.html

Assessing the longevity risk of a Web site will require algorithms for aggregating the risk metrics of its individual pages. Additionally, the structure of the site might serve as an indicator of risk. To analyze this structure we can exploit the wealth of work and algorithms on graphs and the characterization of the Web as a directed graph.⁶⁹ In this characterization, resources (documents) at URLs are nodes and the hyperlinks from documents at URLs to documents at other URLs are directed edges in the graph. The organization of a site's internal structure might be appropriate for risk analysis, just as for an individual page. Using graph analysis methods to derive cliques or strongly connected components from graph representations of site structure may make it possible to develop a set of patterns that reflect good site management.

Based on the static analysis of a site's structure, it would then be possible to analyze changes to it over time. How the Web site evolves should be considered another indicator of risk. A site where links are added or modified regularly and which conforms to a discernable structure exemplifies good management practices, and thus lower risk. Site evolution patterns could be measured through one of the graph similarity algorithms such as editing distance or maximal common subgraphs.⁷⁰

Monitoring a Web Site in a Technical and Organizational Context

A Web site is a collection of Web pages, but it also resides on a server within an administrative context, all of which may be affected by the external technical, economic, legal, organizational, and cultural environment. Identifying, monitoring, and managing the ecology of a Web site involves the individual and collective analysis of a number of factors at these different levels—more than just checking for HTTP codes that indicate a page is unavailable or has moved. Problems can be caused by server software misconfiguration, bad cables and router failure, denial-of-service attacks, and many other factors. It is entirely possible that the biggest threat to the continued health of a Web site has nothing to do with how well the site is maintained or even how often it is backed up, but rather the fact that the backup tapes are stored in the same room as the server and a single catastrophic event (fire, flood, earthquake) could destroy them both.

Illustration showing the ecology of a web site

Illustration 2: Ecology of a Web Site

Comprehensive care of a Web site has to include:

Hardware and software environment, including any upgrades to the operating system and Web server, the installation of security patches, the removal of insecure services, use of firewalls, etc.

Administrative procedures, such as contracting with reputable service providers, renewing domain name registration, etc.

Network configuration and maintenance, including load balancing, traffic management, and usage monitoring.

Backup and archiving policies and procedures, including the choice of backup media, media replacement interval, number of backups made and storage location.

Physical location of the server and its vulnerability to fire, flood, earthquake, electric power anomalies, power interruption, temperature fluctuations, theft, and vandalism.

Some of these environmental factors can be monitored remotely, in tandem with direct monitoring of the Web site itself. Slowness or unresponsiveness could indicate hardware failure or power interruption, excessive load on the server from legitimate use, Web crawling, hacker attack, or a network problem. Network utilities such as ping and traceroute can help determine whether the problem is confined to Web services, the particular machine, or the larger network. Just as dataloggers monitor environmental states in physical libraries and send alerts when an undesirable condition arises, and just as more traditional alarm systems can signal breaches in physical security, specialized software for the Web can reveal internal security hazards, such as viruses, Trojan horses, outdated software, missing patches, and incorrect configurations. Adapting these tools and utilities will add to Project Prism's preservation risk management toolkit.

Assessing the Impact of Technological Watersheds on Web Site Integrity

Just as some print publications failed to make a successful transition from cold type to hot type, or hot type to completely electronic production, the continuing success of an Internet venture depends in part on its ability to adapt to new technologies. Technological change always puts an enterprise under some stress, because it interrupts routines, necessitates staff changes or retraining, incurs expense for equipment and supplies, and in some cases may require a complete reconceptualization of the business plan or method of operation. Since growth of the Internet really took off in the early 90s, the continued robustness of any Net-based enterprise has required a significant level of technological flexibility and adaptability.

What kinds of technological change place the continued existence of content at greatest risk? To answer, we must first understand how technological change induces risk. Several mechanisms can be postulated; their applicability to any particular site depends on the content and its audience. A few examples:

**Table 2: Examples of How Technological Changes Can Induce Risk**
Mechanism	Nature of threat	Means to detect	Tools to detect	External discernability
Failure to maintain up-to-date software operating environment	Vulnerability to malicious code, such as viruses, worms, and other hacks	Examine current status of operating environment	Web crawler (partially) and specialized software tools	Partial or full, depending on consent of site operator
Failure to upgrade file formats, encoding schemes, etc.	Incompatibility with modern software; unreadable content due to obsolescence	Examine current status of MIME types and other attributes	Standard Web crawler	Partial
Failure to use modern tools	Competitive disadvantage: less visual appeal, harder to navigate, or less functionality	Examine current status of MIME types and other attributes	Standard Web crawler	Partial

Predicting what kinds of technological change most seriously threaten content will require retrospective analysis. Through the longevity study (http://www.library.cornell.edu/preservation/prism.html) and future crawls of the Internet Archive, Project Prism is identifying significant technology watersheds that may put Web sites at risk. Determining whether a sea change, such as the shift from HTML to XML, will put much content at risk may be at least partially revealed by examining past shifts of similar magnitude, such as the critical mass shift from the gopher to HTTP protocol.

The Web crawler and other tools can be used to analyze the use of markup languages, MIME types, and other attributes of Web pages that reflect evolving standards and practice. Certain periods may merit closer scrutiny than others. Times of intense and rapid growth generally coincide with greater competition and the need to be more agile and flexible to survive. Periods when many new standards and features are introduced would also be expected to involve greater risk to content. The Web sites that have been captured in the Internet Archive provide an ideal set of materials for testing these hypotheses by allowing characterization of the introduction and domination of markup languages and formats, the introduction of various types of dynamic behavior, and changes in the use of header fields and tags.

Combining the Pieces into a Program

Project Prism is using the Web crawler to study risk factors for Web pages and Web sites. At the server level, we are reviewing the kinds of tools that can be developed or adapted to analyze and mitigate potential risks. While an organization may take on the preservation management of its own Web sites, Project Prism is interested in scenarios that must consider two kinds of organizational players, the entities that control the Web sites and the entities that are interested in the longevity of those Web sites. In the first round, significant factors in the administrative context and external environment are being identified, but in-depth work in these areas will be part of follow-on research.

While Project Prism is currently exploring the passive monitoring of Web Sites that are not within an organization's control, the team expects to develop a methodology that also allows for other mandates an organization might have to:

Monitor changes to a Web site, which may require negotiated access.

Recommend modifications to Web sites and Web pages to enhance longevity, in addition to monitoring.

Actively enforce policy for a Web site and ensure compliance to specified standards, which will require cooperation and collaboration.

Just as an actuarial assessment changes with the times, with better understanding of life styles, and with medical breakthroughs, so too will the ability to detect potential risks to digital and Web-based resources. By developing a flexible and adaptive risk management strategy, Project Prism adds to our knowledge base and offers a methodology for conceptualizing the problem while research moves forward.

Acknowledgements

We thank Liz Chapman for her keen editorial skills. The work described in this paper is supported by the Digital Libraries Initiative, Phase 2 (Grant No. IIS-9905955, the Prism Project).

Notes and References

To see the extensive list of notes and references, click here.

Copyright 2002 Anne R. Kenney, Nancy Y. McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette

(On January 21, 2002 the article was corrected to add acknowledgements.)

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/january2002-kenney

D-Lib MagazineJanuary 2002

Volume 8 Number 1 ISSN 1082-9873