Archiving Digital Cultural Artifacts: Organizing an Agenda for Action

D-Lib Magazine
July/August 1998

ISSN 1082-9873

Archiving Digital Cultural Artifacts
Organizing an Agenda for Action

Peter Lyman
School of Information and Management Systems
University of California
Berkeley, California
[email protected]

Brewster Kahle
Alexa Internet
San Francisco, California
[email protected]

Both authors are directors of the Internet Archive.

Introduction

Our cultural heritage is now taking digital form, whether born digital or born-again by conversion to digital from other media. What will be the consequences of these changes in the nature of the medium for creating and preserving our cultural heritage? What new technological and institutional strategies must be invented to archive cultural artifacts and hand them on to future generations? We will explore these questions through the practical perspective gained in building Alexa Internet and the Internet Archive.

Our purpose is not so much to answer these questions in a definitive manner, but to organize a discussion between communities which must learn to work together if the problem of digital preservation and archiving is to be solved -- computer scientists, librarians and scholars, and policy makers. This paper, then, is the product of a dialogue between a computer scientist/entrepreneur and a political theorist/librarian, and represents an attempt to create a common agenda for action.

Defining the Problem

Who among us can read our old WordStar or VisiCalc files? Find our first email message? Or, on a larger scale, what happens to the history of science if we can't read the first data from the first interplanetary exploration by the Viking mission to Mars? The origins of the digital era are probably already lost, and millions must be spent if strategic government and corporate records are to survive the transition to the year 2000 on digital clocks. (See, for example, Business Week, "From Digits to Dust," April 20, 1998, p. 128-129; see also, U.S. News and World Report, February 16, 1998, "Whoops, there goes another CD-ROM" http://www.usnews.com/usnews/issue/980216/16digi.htm). Digital information is seemingly ubiquitous as a medium for communication and expression, increasingly strategic for scientific discovery and the records that constitute institutional order in a modern society. And yet, it is at the same time fugitive, the pace of technical change makes digital information disappear before we realize the importance of preserving it. Like oral culture, digital information has been allowed to become a medium for the present, neither a record of the past nor a message to the future. Unless, that is, we redesign it now.

In exploring the consequences of digital artifacts and their use for the way we preserve and archive our culture, we will focus upon the World Wide Web. The Web is a born digital cultural artifact intrinsic to the digital/electronic environment that defines the parameters of the question in useful ways. The advantage of this example is that our experience with Alexa Internet and the Internet Archive provides practical experience with the archival problem. The disadvantage is that although now ubiquitous, the Web is only one example of a digital document in a rapidly changing environment. Moreover, in many ways the Web is modeled on print traditions, a publishing technology more than an unprecedented way of representing cultural expression. Other digital artifacts that might yield different kinds of insights are, for example, simulation software like SimCity, visualization and scenario software, Jurassic Park dinosaur animations, or collaboratories and virtual communities. Nevertheless, precisely because of the ubiquity of web-based information resources, and precisely because of their conceptual proximity to known artifacts whose archiving is well understood (i.e., books and manuscripts), looking at the web offers a comparatively powerful beginning point.

What Are Digital Cultural Artifacts?

Culture is something we do, a performance which fades into memory then disappears, but the record of culture consists of artifacts which we make, which persist but inevitably decay. Like other media, digital cultures are simultaneously performances and artifacts, although digital artifacts are profoundly different from physical artifacts. We will not attempt a formal definition of something that is still being shaped by experimentation and practice, except to describe some of the parameters differentiating digital artifacts from other kinds of cultural artifacts that may be useful in building digital archives. Most notably, while things occupy places (and are therefore always local), digital documents are electronic signals with local storage but global range. As things, digital cultural artifacts are dramatically different from those in other media, as illustrated by these estimates of size.

Type

Example

Size When Digital

Newspaper

Wall Street Journal

100Mbytes/year (text)

Computer discussion

Netnews

300Gbytes/year

Television

CNN News

1GB/hour, 6TB/yr (compressed)

Radio

WABC

270GB/yr uncompressed

Internet Publishing

World Wide Web

4 Terabytes in 1997

Video Rental Store

Block Buster Video

9 Terabytes

Research Library

Library of Congress

20Terabytes text in all Books

Card catalog

Library of Congress

17GB

Branch Library version of books

Palo Alto, CA

1.4TB of scanned

Composer's work

Mozart

100MB?

Many other cultural performances and artifacts are entering the digital realm, such as classroom lectures, lecture notes and textbooks, scanned paintings, and government publications. Thus, digital documents are both ubiquitous, by virtue of their global range, and are a universal medium for archiving the record of culture, by virtue of their size and ability to represent cultural expressions in all other media. Relative to physical counterparts:

Copying these works millions of times is inexpensive; and,

Distributing these works to millions is possible in seconds; and,

Saving these works is relatively inexpensive and compact; and,

Organizing these works is easier because they can be searched, and reordered in seconds; and,

Collaboration in making these works is possible between people all over the world, and

Processing these artifacts directly with a computer opens the possibility of building a library of human knowledge that can find patterns that people would be unlikely to find.

Two consequences for the problem of digital archives are worth noting. First, digital documents are at once tangible (representation in code) and at the same time intangible (the code is meaningless unless transmitted and represented), thus what must be preserved is the totality of a dynamic performance consisting of both text and context -- the unit of knowledge is the entire Web, over time. Secondly, digital cultural artifacts are not the property of cultural elites, for this medium is profoundly democratic -- millions of people are creating cultural artifacts in intangible forms, using computers and networks. Neither are they archived by the traditional cultural institutions organized and funded by cultural elites.

The World Wide Web as a Cultural Artifact

Although the Web is an original new medium for cultural expression, like all new modes of representation of knowledge the first experiments are likely to imitate the forms of past media. It is original because although other communication technologies are global, this one has no central control points (other, perhaps, than the definition of technical standards). It is new because its cultural expressions will be in multimedia (to use that redundant term), even though today its guiding metaphors are derived from print publication. We know very little about its character as a cultural artifact, both because it is new and because it is decentralized. The key questions about it are not to be answered in the nature of its artifacts alone, but in the emerging social forms which are made possible by these new media: What is a virtual community? A transnational financial market? A collaboratory? Who is "the public" on the Web? What is the nature of personal identity on the Web?

The first step in the archeology of the Web has been to use other kinds of cultural artifacts as guiding metaphors, as if it is a text or library, in order to understand its deep structures. These metaphors are useful, but limited.

A library provides the user with catalog technologies and services to search collections. And yet, most of the guide services through the Web are not digital libraries, because they fail to describe its totality, having adapted the model of library catalogs designed to index the intellectual world of a century ago (see, for example, Steve Lawrence and C. Lee Giles, "Searching the World Wide Web." Science (3 April 1998) 280:98-100, or http://www.research.digital.com/SRC/personal/Krishna_Bharat/estim/367.html).

An archive has the mission of preserving primary documents, generally associated with the history of a particular institution, and often requires specialist knowledge to search.

But the Web, of course, is not a true library collection, one selected specifically to meet the information needs of a given community, nor is it an archive, preserving the historical memory of a given institution. Within these parameters, then, the Internet Archive might be described as a true archive, seeking to collect and preserve the entire Web, past, present and future (http://www.archive.org/. Alexa Internet is a kind of digital library, seeking to identify the intellectual structures of the whole Web in order to provide users with technologies to find the right information, but through dynamic link analysis rather than a catalog based upon structured records (http://www.alexa.com). Alexa Internet analyzes the quality of information on the Web by describing the patterns of its use, using indicators that analyze the link structure of a database containing the contents of the public portions of the World Wide Web (those not requiring a password or fee) since October 1996. The database is later donated to the Internet Archive, a not-for-profit organization that provides access to the data to scholars and researchers interested in the growth of a new medium.

However, these metaphors compare the Web to an institution -- a library or an archive -- rather than defining it as a new kind of cultural artifact that will require the invention of new kinds of institutions and management techniques. Described as a cultural artifact:

the Web is a medium for publishing; and,

uses a rhetorical structure based on hypertext; and,

is a multimedia text including mostly words and numbers, some fixed and some dynamic, and images equivalent in size to a library of 1 million volumes; and,

was written by seven million authors; and,

most of it is distributed for free around the world.

Like reading a book, every reading is a unique performance in which the user links information together; but unlike reading a book, every reading leaves a trail, which can be collected and archived. These links are the trails through an information wilderness. Alexa Internet is mapping, trying to discover the structure of the Web by understanding how its information is used.

Alexa Internet's Web statistics sketch a remarkable picture of this new domain of cultural expression; the statistics cited below are from May 1997, by Z Smith (http://www.webtechniques.com/features/1997/05/burner/burner.shtml). See also the inventory of Internet statistics and demographics, at http://www.yahoo.com/Computers_and_Internet/Internet/Statistics_and_Demographics/.

Collectively, they begin to answer some baseline questions about the Web as a cultural artifact: What is the Web, described as a technical artifact? What is the Web, described in terms of the social functions of digital documents?

What is the Web, from a technical point of view? As of January 1997, one million Web site names were in common usage, on 450,000 unique host machines (of which 300,000 appear to be stable, 95% accessible at a given point in time), and there were 80 million HTML pages on the public Web. The figure is incomplete, because some sites are dynamic (generating unique pages in response to queries). The typical Web page had 15 hypertext links to other pages or objects and 5 imbedded objects such as sounds or images. The typical HTML page was 5 KB, the typical image (GIF or JPEG) was 12KB, the average object served via HTTP was 15KB, and the typical Web site was about 20% HTML text, 80% images, sounds and executables (by size in bytes). The median size for a Web site was about 300 pages; only 50 sites had more than 30,000 pages; the 1000 most popular sites accounted for about half of all traffic. In mid-1997 it took about 400GB to store the text of a snapshot of the public Web, and about 2TB to store non-text files.

Because the Web is dynamic and seems to be doubling yearly, the typical Web page is only about two months old. The typical user downloads around 70KB of data for each HTML page visited, and visits 20 Web pages per day. One percent of all user requests result in '404, File Not Found' responses. After analyzing search engine data from Alexa Internet, Mike Lesk commented that "free" is the most used search term, not "sex," as might have been predicted; sex ranks second, and "free sex" is the most used phrase.

Who uses the Web? The Web is also a society that, although global, is not universal. Worldwide, English speakers are about 65% of the world online population (http://www.euromktg.com/globstats/), but hundreds of languages and dialects are used on the Internet. Business Week (May 5, 1997), commissioned a census of the use of the Web by 1000 U.S. households, which begins to document its expanding cultural domain. 21% of adults, an estimated 40 million people, use the Internet or the World Wide Web, double the number a year ago. The online census is now 41% female (up from 23% in 1995), but still 85% white, and 42% have incomes over $50,000 a year; the study comments that student users probably over represent the use of the net by the poor. According to the survey, the Net is primarily used for research (82%), education (75%), news (68%), and entertainment (61%); online shopping was only 9% of use, but about 25% have bought something on the Net, and the number of .com sites has expanded dramatically since the survey. Entertainment is a more likely use among the young (51% of 10-29 year olds) and surfing among those 18-29 (47%). Surfing was common among only 30% of those 50 and older. Most preferred sites that were not interactive (77%), but among those using interactive sites, 57% said they felt a sense of community.

Who Will Preserve Digital Cultural Artifacts?

The only problem is, digital documents are disappearing. Alexa Internet has created an archive of "404-document missing" Web Pages, because although the World Wide Web is now growing at an estimated 1.5 million pages a day, most of them disappear every year. The Alexa Internet archive of "404" Web pages is now 10 TB, and may be accessed by downloading the Alexa software (http://www.alexa.com). Given the dramatic growth of digital media, it is paradoxical that we do not yet know how to preserve digital cultural artifacts.

Print made possible institutions, in the modern sense of the word in which social order is based upon record keeping. Record keeping combined with archival preservation of other kinds of documents makes possible the historical memory that gives culture continuity and depth. Cultural institutions like universities, publishers, theaters and symphonies are dedicated to enacting the cultural traditions which we call civilization, and institutions like libraries, museums, and archives are dedicated to collecting, organizing, conserving and preserving cultural artifacts. What are, and will be, the social contexts and institutions for preserving digital documents? Indeed, what new kinds of institutions are possible in cyberspace, and what technologies will support them? What kind of new social contexts and institutions should be invented for cyberspace? Consider just a few of these questions, seen through the lens of the transitions from manuscripts to print.

What is a digital document? It took a century after the invention of the printing press to define new formats for culture (See Elizabeth L. Eisenstein, The Printing Press as an Agent of Change. Cambridge: Cambridge University Press, 1979). What new formats for the invention and representation of culture will be derived from the computer and the network? See, for example, the digital documents track of the Hawaii International Conference on Systems Science http://www.cba.hawaii.edu/HICSS/hicss_31/toc2.htm.

Who will be the Internet Publishers? Who will serve consumers by producing authoritative high quality digital formats? Although HTML seems new, it is ultimately an implementation of 3x5 card technology on a global scale. Innovation is not simply a matter of digitizing the print record, but of inventing new kinds of literatures. For example, sociologist Howard Becker notes that the emergence of new modes of digital visualization is the first time that there has been innovation in the representation of data since the invention of the bar chart, pie chart and histogram several centuries ago. If modern science emerged from the interaction of new technologies (the telescope, the microscope) with new kinds of documents (the scientific article, the journal), how will science as an institution change with the introduction of visualization, which enables enormous quantities of data to be rendered meaningful for humans, and scenarios, which make possible dynamic if/then arguments?

What is an Internet Library? What kinds of libraries and archives will preserve the sciences made possible by digital documents? Who will serve researchers by collecting, filtering, and organizing digital documents? See "Preserving Digital Information: Report of the Task Force on Archiving of Digital Information," commissioned by the Council on Library and Information Resources (CLIR) and Research Library Group (RLG), at http://www.rlg.org/ArchTF/, and Mike Lesk, "Preserving Digital Objects: Recurrent Needs and Challenges," http://community.bellcore.com/lesk/auspres/aus.html.

What is a Digital Archive? Who will serve scholars and historians by collecting and preserving digital culture, including both the technology necessary to access and use it, and guarantee its authenticity? See Brewster Kahle, "Preserving the Internet" Scientific American, March 1997 http://www.sciam.com/0397issue/0397kahle.html. For a dialogue between technologists, libraries and museums on the technical agenda for a digital archive see Time and Bits (http://www.gii.getty.edu/timeandbits/).

Many of these concerns resolve into sociological questions. While discussion of this social agenda is now beginning to take off in national information policy debates, we believe that it is premature to define final institutional forms before there is a technological response, an agenda for a more robust design for digital documents which recognizes their cultural importance.

Examples of Cultural Innovation on the Web

This is a time of both technological and social invention, indeed the two are inseparable. Alexa Internet and the Internet Archive are only two examples of cultural innovation in the development of a Web literature. In print, a literature is interconnected by citations and has structure because it has been filtered by editorial boards before publication; on the Web, new technologies must be created to define quality and to discover organization. The following list contains only a few examples among many still in process working on the production of a new kind of cultural artifact.

Search engine/navigation services (AltaVista, Yahoo, Excite, etc.)
Since the Web is governed by technical standards, its overall content has no inherent structure. But, it is searchable, and numerous navigation companies have grown. At a recent talk at Stanford's Gates Hall, the Alta Vista navigation technology was described in these terms: 550 gigabytes of text have been indexed; 110 million web pages; 30 million words in the index; 90 million hits per day (up to 5 million per hour). Most striking is that AltaVista can search the equivalent of a stack of paper 100 KM high for a single word in half a second, and can do this simultaneously for 100 users.

Internet Archive ( http://www.archive.org)
The Internet Archive is a non-profit organization that accepts donations of digital materials and preserves them, serving historians and researchers. Preserving the materials requires copying the material periodically. Currently, the storage medium is Digital Linear Tape (DLT) which is specified to last 30 years, but the Archive will copy them within 10 years. Some of the historical and scholarly users of this data have been the Smithsonian Institution, Xerox PARC, AT&T Labs, Cornell, Bellcore, Rutgers University. Some of the uses have been the display of historic websites, the study of human languages, the growth of the Web, and the development of human information habits.

Metadata services (PICS, Truste, Alexa)
A new breed of services is starting to emerge: metadata services. By allowing a user to view both the source and the metadata at the same time, these services will change how information is used, giving a "heads up display" of information about the information. For example, PICS is a standard for expressing some structured information about a website (http://www.w3c.org/pics). RSACi is a non-profit organization that is using the PICS format to rate websites for adult material (http://www.rsac.org/). Other related standards are XML and RDF (find specs on w3c.org). Truste (http://www.truste.org) is an organization that establishes privacy policies, helps organizations adhere to them, and promotes those that do.

Alexa Internet has created a free navigation service that annotates the Web page the user is looking at to show meta-information about that site and links to other related pages. The information about the site can be seen as a card catalog entry about that section of the web. The related pages are computed based on where other users have gone, the link structure of the net, and the contents of the web pages. Thus Alexa is creating a dynamic organization based on use of the net and presenting it as "metadata" on the page and site. As part of this effort, Alexa has archived 8TB of web documents and netnews postings. These are used in the data mining efforts, and later are donated to the Internet Archive. Alexa Internet has metadata about every site including a compendium of RSACi, Truste, and others.

DejaNews (http://www.dejanews.com)
DejaNews are making pieces of the past available to all comers. By collecting, indexing, and offering a smooth interface to public Netnews postings (a bulletin board system with over 30,000 discussions and 200,000 postings a day), DejaNews changes the nature of Netnews to an informal literature that can be used as a reference resource. To handle the issue of authors not wanting their writings to be archived and searched, DejaNews has a "purge the complainer" policy and deletes their messages.

Internet Movie Database (http://www.IMDB.com)
IMDB uses hypertext with grace and power, serving as a resource for other websites. It is richly woven into the fabric of the Internet by having stable URL's to which other sites can point. Thus, it has become the de facto standard for information about actors, movies, and other related information. In this way it is similar to Amazon.com. It goes beyond Amazon, however, in linking to other sites for reviews and schedules. With only 15% of all websites having any links to other websites, IMDB stands out as a model for future integration.

Archiving such a site is a resource and a challenge. Just archiving it would not show the power of the site, whereas having it as part of an archive can help serve as an index to the rest of the Web. In this way, we find that archiving the web may have to be done comprehensively, rather than just a sample here or there to understand the web phenomenon.

Digipaper (http://www.parc.xerox.com)
The Digipaper technology is a network-oriented, scanned paper format. Its player is written in Java, so no special software is needed, and the technology is designed to compress scanned images for network access. This is different from FAX format (CCITT Group 4) or Adobe's Acrobat (PDF), in that this application is network oriented. This work might open the door to large scale scanning and distribution projects.

Conclusion: What Technical Work Remains to be Done?

Our goal has been to define problems that might be solved collaboratively rather than to propose solutions; thus, we propose the following schema to organize the work:

Infrastructure technologies;

Technologies for Digital Publishing;

Technologies for Digital Libraries;

Technologies for Digital Archives;

Time Capsule technologies.

1. 0 Infrastructure technologies.

Infrastructure technologies are both technological and political, including both legal and engineering standards. The issues and requirements include the following:

1.1 Build a legal infrastructure tolerant of digital libraries, archives, and museums. Will intellectual property law, now being optimized for electronic commerce, allow for the preservation of digital documents by public institutions like libraries and archives? While copyright law evolved provisions for libraries, archives, and museums, the proposed laws tend to treat digital documents exclusively as private property, governed by contract rather than copyright (http://sims.berkeley.edu/BCLT/events/ucc2b/). Trends towards licensing information rather than having information under copyright may end the Fair Use of digital documents for educational purposes, the circulation of information, and copying for preservation and archiving. For example, in the early 20^th century, laws governing the preservation of film were differentiated from those governing the preservation of books; as a result, today there is no comprehensive archive of radio or television. In addition, national legal codes governing digital cultural artifacts must be coordinated on a global scale through treaties, since in a sense, local or national jurisdictions are no longer enforceable through traditional means.

1.2 Build a high-speed data network. In the United States, most Internet traffic uses voice communication lines. Whereas this helps make the Internet quickly deployable, it is limiting in terms of cost reduction. Where computer components have cost-performance improvements of 100% every 18 months, the long distance phone technology evolution is measured in decades. Thus, computer processors, RAM, disk, and LAN speeds have all been rapidly improving while the long distance systems have not.

1.3 Define a standard public video format. Sometimes when a public standard is established, use flourishes (e.g., TCP/IP, HTML). Several proprietary video formats are now being promoted on the Internet. To build a popular and long term archival medium, it is very helpful if a format is internationally standardized and non-proprietary.

2.0 Technologies for digital publishing.

The current web publishing tools use freeform hypertext and do not yet encourage a set of templates. If printed book publishing can be used as a precedent, then idioms of page formats, tables of contents, indexes, page numbers, and the like will emerge so that different websites will have standard structures that can be counted on. If these paradigms existed in the web, then navigation and categorization tools could be effectively applied. For instance, it would then be easy to tell the difference between a personal homepage, scholarly publication, or corporate brochure from an email message. At this point, it is very difficult to tell the difference with any certainty. These formats, fortunately, are maturing in the tools and standards committees for websites but currently, digital publishing is a mess. Metadata standards for declaring the intentions of the authors in the areas of structure and usage rights will also be helpful. Furthermore, making it easy for authors to use URLs that stay stable across many versions of their websites will be helpful in allowing others to refer to the documents and services over time.

3.0 Technologies for digital libraries.

Alexa Internet is an example of a library of digital materials -- the web and netnews -- but there are many other examples, such as Paul Ginsparg's XXX server at Los Alamos, and archives of astronomical, meterological or medical data, which can be rendered as images, statistical data sets, and so on. The technologies helpful in building such collections are gatherers, storage mechanisms, data mining tools, and serving tools. In the case of Alexa Internet, some of the components could be purchased, but most of the technology had to be developed internally. More tools for dealing with terabyte digital collections would be very helpful this field. Most software tools do not support terabytes very well, even though the cost of the equipment is quite low.

We see the opportunities in these areas to be exciting and essential to the evolution of this field.

3.1 Gathering. Alexa and the Internet Archive act as a centralized repository for data so every researcher or company does not need to write its own gathering software. This service needs to be matured and could be helped by cooperation with other organizations that gather large collections.

3.2 Storing. Storing and managing 300 million objects (increasing to 1 billion soon) tax most existing storage and database technologies. The commercial database technology that Alexa tried could not perform fast enough transactions on low cost hardware; therefore, we needed to write all our own management and indexing code. Offsite storage and redundancy is essential. While Alexa does create a copy and does give it to the non-profit entity Internet Archive for storage and protection, both institutions are in the United States, where a change in law could well make certain kinds of collections and activities illegal. A more resilient strategy would be to engage a set of active organizations and to exchange collections.

3.3 Data mining. Finding patterns in terabyte collections of semi-structured information is different enough from much of the datamining work being done by mailing list companies that we have not found a match with commercial tools yet. Therefore, we have written our own tools. We hope this area will be of interest to academics because of the fertile ground for new ideas in pattern finding and artificial intelligence based on this large semantic network.

3.4 Serving. Alexa Internet serves information about a user's current webpage such as site statistics and related sites. This information must be dispensed for every page turn of every user. As these technologies are built into browsers, this needs to be done on every page turn of every user on the Web. Alexa Internet has built servers to be able to do this. Further work in this area can be quite fruitful for building high capacity services.

3.5 Critical Mass Digitization Project. Michael Lesk has argued that the cost of scanning a book and making it available on the Internet is less than building the space to store the book in a library (Practical Digital Libraries: Books, Bytes and Bucks. Morgan Kaufmann, 1997). While the Web is often the information resource of first resort, print publication has been the medium of record for quality, and print libraries are far more comprehensive and reliable. If a large-scale library were to be scanned and offered on the Internet, then we would be confronted with a potential testbed for new models of copyright and royalties and might then develop new economic models for digitization of print. Current projects relevant to this goal include:

Digital Library Federation (http://www.clir.org/diglib/dlfhomepage.htm)
The Digital Library Federation was founded to develop standards and best practices ensuring that distributed digital library collections could be shared. Members include twelve university research libraries, the Library of Congress, the National Archives and Records Administration, the New York Public Library and the Commission on Preservation and Access. Currently the DLF is developing a project called The Making of America, Part 2, to to build a shared national archive of documents illustrating American history.

NSF DLI projects (http://dli.grainger.uiuc.edu/national.htm; also http://www.dlib.org/script.html)
While the DLF is focused on standards governing digitized paper archives, the first phase NSF Digital Library projects have been focused on developing computer science research which could support the digital libraries of the future. In practice, some have produced innovations which make possible entirely new kinds of documents, often including new ways of visualizing and organizing scientific data. Two strategic planning papers by Bernie Hurley, Chief Scientist, The University of California at Berkeley, for the technical infrastructure for a distributed national library may be found at (http://sunsite.berkeley.edu/moa2).

4.0 Technologies for digital archives.

4.1 Low Cost Bulk Storage. If we assume that material worth archiving is limited by our server storage systems, then we need an archival medium that is less expensive that original copy. Tape storage systems have historically been the inexpensive mechanism to store large amounts of data. If these tapes are put into a tape robotic system, then they can be accessed slowly and inexpensively. Unfortunately, the cost per gigabyte of these robot systems may not offer much cost advantage over disk subsystems, given the historic trends. Currently, the cost per gigabyte on a disk is about $50, and on a tape robotic system it is about $12. David Patterson, of UC Berkeley, said in discussion: "If you follow the curves, they will cross in 4 years." Thus archival storage might soon be on the same medium as the originally published material which will make it as costly to archive as it is to serve. This could severely limit what can be cost-effectively saved.

4.2 Long Term Storage. While good paper lasts 500 years, computer tapes last 10. While there are active organizations to make copies, we will keep our information safe, we do not have an effective mechanism to make 500 year copies of digital materials. There is one company, Norsam ( http://www.norsam.com ) with a technology to write micro images on silicon wafers, which can be used for this purpose.

4.3 Archive Television. We are not aware of any comprehensive collections of television or radio. The original producers may have copies, but the networks often do not have copies, nor does any library. Before television grows much older and mutates much further, we believe it would be important to have a record of these cultural artifacts.

5.0 Time capsule technologies.

What will be an Internet Time Capsule, serving archeologists of the distant future? The techniques suggested in much of the digital archiving work require an institution to "refresh" magnetic media every 10 years. If future technologies also require this frequent refreshing, then the digital artifacts will not last through a dark age of the future. Therefore, another writing technology would be needed to endure 1000 years without maintenance. To be viable, this "time capsule" technology would not have to be as easily readable as the archival technology, and could be more expensive to write because it is assumed that it will be more selectively written and read.

How will historians of the digital age read old code? Is it possible, as Danny Hillis has speculated, to build a universal Turing machine that would emulate all of the operating systems of the past?

A discussion of these technologies has been started by The Long Now Foundation, http://www.longnow.org/.

©1998 Peter Lyman and Brewster Kahle. Permission to reproduce this story has been granted by the Authors.

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

hdl:cnri.dlib/july98-lyman

Type	Example	Size When Digital
Newspaper	Wall Street Journal	100Mbytes/year (text)
Computer discussion	Netnews	300Gbytes/year
Television	CNN News	1GB/hour, 6TB/yr (compressed)
Radio	WABC	270GB/yr uncompressed
Internet Publishing	World Wide Web	4 Terabytes in 1997
Video Rental Store	Block Buster Video	9 Terabytes
Research Library	Library of Congress	20Terabytes text in all Books
Card catalog	Library of Congress	17GB
Branch Library version of books	Palo Alto, CA	1.4TB of scanned
Composer's work	Mozart	100MB?

D-Lib MagazineJuly/August 1998

ISSN 1082-9873

Organizing an Agenda for Action

Introduction

Defining the Problem

What Are Digital Cultural Artifacts?

The World Wide Web as a Cultural Artifact

Who Will Preserve Digital Cultural Artifacts?

Examples of Cultural Innovation on the Web

Conclusion: What Technical Work Remains to be Done?

©1998 Peter Lyman and Brewster Kahle. Permission to reproduce this story has been granted by the Authors.

D-Lib Magazine
July/August 1998