Research in Support of Digital Libraries
at Xerox PARC

Part I: The Changing Social Roles of Documents

Marti A. Hearst
Xerox PARC
3333 Coyote Hill Rd,
Palo Alto, CA, USA, 94304,
[email protected]

D-Lib Magazine, May 1996

ISSN 1082-9873

[The second of this two-part story appears in the June issue of this magazine.
Editor, November 26, 1996]

Introduction

When Xerox PARC (Palo Alto Research Center) was founded in 1970, its charter was to develop technology to support the ``architecture of information'' [Pake]. Many important contributions resulted from this call, including the first personal computer with a user-friendly interface and bitmapped display, the first WYSIWYG text editor, laser printing, and the ethernet network that could flexibly connect workstations, file servers, and printers in order to provide communication among many workers. Perhaps because of the social nature of information creation and use, much of the technical research at PARC has emphasized the human-computer interaction [Card et al.] and social aspects of computing.

Today, there is a growing understanding throughout the technical, business, and legal worlds of the importance of the social aspects of technology, and social factors are recognized as being especially important in digital library research [Levy and Marshall]. There is a very wide range of problems in digital libraries that link the social to the technical. Issues surrounding the coordination of naming and cataloging of documents and other information artifacts, compensating authors and publishers while at the same time promoting access to written works, and the role of human and automated intermediaries in information seeking tasks are being extensively studied.

The digital library-related research at PARC spans the spectrum from the examination of the social role of documents to state-of- the-art technology for finding documents about social roles. This two-part article describes a some of the ongoing digital library-related research at PARC.

This first part introduces very briefly some recent investigations into the social roles of documents and how they are changing as digital representations of published works become more readily accessible. The second part, to appear next month, will present overviews of relevant technologies that support the creation, capture, use, search, synthesis, and presentation of documents and information.

Work Practices and Digital Libraries

For many years, researchers at PARC have studied what is known as work practices, that is, looking closely at the use of technologies in specific organizational settings, uncovering the implicit and perhaps unrecognized assumptions that workers bring to their tasks, and understanding what technical and social channels workers use to cooperate and teach one another. There is a very large literature associated with this work (see [Suchman] for references). Most recently, several work practice researchers at PARC are helping design and evaluate the ideas and technologies developed for the NSF-sponsored digital library project at UC Berkeley [Van House]. This work is ongoing and so is not described in detail here.

The Social Role of Documents

PARC has a long history of research on innovative, interactive document representations, including significant work in hypertext (see, for example, [Halasz et al.]). The growing influence of the World Wide Web and the Internet are also contributing to the mutation of our understanding of what it means to publish. As is commonly noted today, the notion of what constitutes a document is becoming increasingly complicated and amorphous. As one example, documents containing non-textual media and even dynamic elements may soon no longer be considered an aberrant form.

The social role of documents and how this will change as the influence of on-line, digital forms continues to grow is quite relevant to digital library research. Several PARC researchers have written on the role of books and documents in society, past, present, and future, and the remainder of this article attempts to acquaint the reader with a few of their ideas.

John Seely Brown and Paul Duguid have written about the social role of documents [Brown and Duguid]. They argue that documents, both pre-dating and within the digital age, are as much a means for creating and maintaining social structure as they are a means for constructing and conveying information. For example, fan magazines and other cheaply produced newsletters are often put together at home by one or two people and are ``mid-cast,'' that is, sent out to small groups, and thus unite geographically scattered people who have never met, giving them a common sense of community. This phenomenon also occurs, and to a larger and growing extent, in their newer on-line variants (often referred to as ``zines'').

One important aspect of this kind of publishing is its volatility and how this volatility is reflected in the corresponding social groups. Brown and Duguid write: ``[T]he growth of zine titles, both on and off the Internet, may also indicate how much more volatile new documents make social worlds. The key to forming a new group is starting a new publication to help hold it together. Consequently, as publication costs come down, formation becomes much easier. ... Equally, however, disintegration is also easier. ... [O]nce formed, social worlds continually face disintegration (as dissenting members split off into `sub-worlds'). In the past, the cost of starting a new sub- group undoubtedly put limits on dissent. As the costs descend, forming a splinter group becomes easier. ... Old paper forms may, then, have been a resource for stability.'' However, Brown and Duguid are careful to note that they do not claim that documents themselves determine social processes. Rather, technology is an enabler with the potential to support various scenarios.

Brown and Duguid also discuss the use of documents as a means for negotiation. They note that conventional forms of publishing severe the link between the original document and the commentary made on it, moving comments from the margins to the bottom of pages to the back of the book. The rise of hypertext reintroduces the usefulness of the document as a means for supporting dialogue and commentary. Because writing often promotes more writing, documents can be used both to extend debate or as a common basis for agreement. As another example, they consider the case of faxes, non-digital document that can be easily annotated. Annotated faxes show the trail of an argument as well as the participants in the discussion, as comments are written on comments, and addressees' names are appended to addressees' names. They suggest that the popularity of fax machines, a non-digital technology, is not surprising precisely because of the close analog link between the text and the commentary.

David Levy of PARC writes about a related idea: the perceived contrast between the fixity of paper documents and fluidity of digital documents [Levy]. He writes that traditionally one of the most salient characteristics of documents has been their fixity, that is, the fact that their contents remain stable and unchanged across time and space, allowing people through the ages to have access to the same meanings or communicative intent. Today, however, with the increasing use of digital technologies, it is often asserted that we are moving from the fixed world of paper documents to the fluid world of digital documents. Levy argues that all documents, regardless of medium, are both fixed and fluid. He notes that paper documents are subject to change, as in the fax example given above, and that digital documents have fixed properties. For example, before someone can edit a digital document, a fixed version must be loaded into the word processing program. Only those parts that are explicitly edited are changed; the rest of the document remains unaltered.

Brown and Duguid comment on the social consequences of the fixity/fluidity contrast:

``[T]he fixed, immutable `document' is best understood not as an inferior and outdated alternative to conversation or other types of unmediated and immediate communication, but, in appropriate places, as an object that plays valuable social roles because it mediates and temporizes, records traces and fixes spaces, and demands institutions as well as technologies of distribution. Attempts to introduce time stamps, hash marks, and other forms of electronic version identification stress how important to social and particularly legal institutions the idea of a fixed state of a document is. ... Already, many documents retain a constant text while their links are continually changed. As the social roles of continuity and change, of areas of status and areas open to dynamic revision, are better understood, social institutions may develop around this joint capacity [of fixity and fluidity] in intriguing ways, much as libraries developed their usefulness out of the juxtaposition of fixed individual texts combined to an ever expanding collection and a continually revised set of interlinked catalogues. This interplay between fixity and fluidity, formerly possible only on the scale of collections may now become a central feature of individual documents.''

Another important aspect of the changing social role of documents is the effect on what it means to publish. Geoff Nunberg, in an essay entitled ``The Places of Books in the Age of Electronic Reproduction,'' writes about the interaction between publication and the social creation of a body of knowledge [Nunberg]:

``[T]he shift to electronic publication wouldn't be possible in the absence of a social organization that enables scientific communities to compensate for features of print discourse that are lost in the transition. For example, electronic publication by itself can't canonize an article in the way that publication in a prestigious print journal or review can, partly because of the reduction of editorial authority, and partly because the form of publication provides no guarantee that other members of the community will have seen the article. In scientific communities, however, formal publication isn't the only or even the most important way of bringing research to the attention of the relevant audience. A large part of scientific discourse is transacted through seminars, conference papers, exchanges of photocopies, and most important, in informal discussions among practitioners (a type of discourse that electronic communication extends and enhances in a very useful ways).''

Digital Property Rights and Document Distribution

The discussion above centered around how freely accessible documents help to shape the social space. This section addresses social aspects of document use and distribution in the commercial world.

A great deal of attention in the discourse surrounding electronic publishing and digital libraries centers on the question of how documents can be copied and distributed while at the same time fairly and efficiently compensating the authors of the works. Mark Stefik of PARC has proposed a set of ideas called Digital Property Rights that include a technological component that has the potential to enable new forms of exchange and distribution of digital documents and other intellectual commodities [Stefik].

Digital property rights take into account the practices and uses of documents and their newly mutable forms, and attempt to satisfy the needs of publishers and users of published works. The technological base rests on the idea of trusted systems, that is, ``a computer system that can be relied upon to respect the rules governing the use of a digital work.'' A trusted system can keep track of which rights are associated with which works and who has access to those rights. However, the aspect of the work that is of interest to this discussion is the rights language and what it entails about how published works are used, and this can be understood independently of the underlying technology.

In developing the specifications for a digital rights language, perplexing philosophical questions such as what does it mean to make a copy, and complicated social issues such as how to provide fair use of digital documents can be addressed, or at least clarified to some extent.

Stefik observes that there is confusion about what it means to make a copy. With a photocopier, making a copy means putting marks on paper that can be used in the same way as the original. This analogy also applies well to the copying of videotapes. However, it does not extend well to making copies of documents on computers. Simply copying the bits from a network to an input buffer to some part of main memory could be considered making three copies of the document. But this kind of bit replication does not constitute the creation of three usable copies, and this is the critical point. The usability of the copy is what is of interest; publishers and authors should be able to expect to be compensated for usable copies.

Further extending this idea, Stefik suggests making a distinction between a Copy right and a Transfer right. A Copy right makes a new usable digital copy without destroying the old one. A Transfer right makes a new usable copy and destroys, or makes inaccessible, the old one. The Transfer operation is similar in behavior to a bank transaction in which a customer transfers money from one account to another; once the transfer has occurred the money no longer exists in the original account. Similarly, when a person loans a book to a friend, the lender no longer has a copy of the book. Stefik discusses the possibility of a Loan right, which is similar to a Transfer right, but has time limits associated with it. After the loan period is over the rights to the use of the document revert automatically to the lender and the book is no longer accessible by the lendee. The transaction could be set up to offer an extension to the loan period, potentially for a fee, as well as offering an option to buy the work. Furthermore, a loaning library could offer a combination of for-free and for-fee usage rights. For example, a library could have five copies of a popular book available for free and ten copies available for a small fee. Those patrons who did not wish to wait for a free copy to become available could pay a fee for faster Loan access (but still pay less than would be required for purchasing a copy outright), and this fee could be used to subsidize more for-free copies.

Stefik argues that if digital libraries made use of a mechanism like the Loan right, publishers and authors would not need to be concerned about loaning libraries undermining the value of their digital works because the number of copies available would be kept constant. At the same time, loaned copies will never be lost or turned in late, because as soon as the time period has expired the library will recover the rights to access its copy of the document.

Stefik points out [private communication] that it is perhaps paradoxical that given availability of digital documents, library patrons might have to wait in order to read a book. On the other hand, publishers are trying to find ways to recoup costs and some fair way to amortize costs across users. The ``conservation of copies'' idea is one way that we already understand to amortize these costs, but it is not the only way, and alternative models should be considered.

Once digital property rights are established, mechanisms are needed for fast and efficient distribution of the services those rights support. Bernardo Huberman, Tadd Hogg, and other PARC researchers have been studying the social and computational aspects of large distributed systems. (See [Huberman et al.].) In the context of global distributed markets, Huberman has developed algorithms for what he calls Market Based Document Services. He notes that current technology for document services is predominantly manual, requiring the user to be aware of what services are available in advance. Having the user specify the service can also lead to inefficiencies, since the user probably does not know about the most appropriate and cost-effective resources. For example, a high-quality printer may be preferable for some tasks, but the user might not know of the existence of such a printer, or how to send documents to it. To improve this situation, Huberman has developed a novel way of providing document services which relies on computer-mediated auctions. These auctions automatically pair the needs of the user with the best matched providers, using the Internet as the communication medium. Based on previous experience with auction based algorithms for resource allocation in distributed computer systems he conjectures that this mechanism will be fast and efficient enough to lead to true market fair prices, large savings for customers and good matches between the needs of the user and the available resources.

To Appear in Part II

This article has given brief account of how some PARC researchers expect the nature and use of documents to change as digital libraries and electronic publishing continue to expand in importance. Part II will describe some of the technology created at PARC in support of these emerging phenomena. The focus will be on three main areas:

Capture, analysis, and presentation of document images, including document image decoding, image search and retrieval and creation of new paper presentations that combine information from multiple sources.
Information access and visualization, including search, browsing, and visualization of large text collections, summarization, category assignment, and automatic detection of thematic structure.
Middleware for the support of document services, including a system architecture to support connectivity of distributed document services and a uniform programming interface to document management systems.

Some of this work is being used in the NSF-sponsored digital libraries projects.

Acknowledgments

Gary Kopec provided invaluable assistance with the construction and contents of this document, and Mark Stefik, Per-Kristian Halvorsen, and Amy Friedlander provided many helpful comments.

References

[Brown and Duguid] John Seely Brown and Paul Duguid, ``The Social Life of Documents,'' http://www.firstmonday.dk, May 1996 (also, in Release 1.0, Esther Dyson (ed.), October 11, 1995).

[Card et al.] Stuart K. Card and Thomas P. Moran and Allen Newell, ``The psychology of human-computer interaction,'' L. Erlbaum Associates, 1983.

[Halasz et al.] Frank G. Halasz, Thomas P. Moran, and Randall H. Trigg, ``Notecards in a Nutshell,'' in the Proceedings of the CHI + GI 1987 Conference on Human Factors in Computing Systems and Graphics, Toronto, Ontario, April 5-9, 1987.

[Huberman et al.] Bernardo Huberman and Tad Hogg, ``Distributed Computation As An Economic System,'' Journal Of Economic Perspectives, 1995 Winter, V9 N1:141-152.
Natalie Glance and Bernardo Huberman, ``The Dynamics Of Social Dilemmas - Individuals In Groups Must Often Choose Between Acting Selfishly Or Cooperating For The Common Good'' Scientific American, 1994 Mar, V270 N3:76-81.
For more information: ftp://parcftp.xerox.com/pub/dynamics/dynamics.html

[Levy] David M. Levy, ``Fixed or Fluid? Document Stability and New Media,'' in the Proceedings of the 1994 European Conference on Hypermedia Technology'', ACM Press, 1994.

[Levy and Marshall] David M. Levy and Catherine C. Marshall, ``Going Digital: A Look at Assumptions Underlying Digital Libraries,'' Communications of the ACM, 38 (4):77-84, April 1995.

[Nunberg] Geoffrey Nunberg, ``The Places of Books in the Age of Electronic Reproduction,'' Representations, (42), 13-37, Spring 1993.

[Pake] George E Pake, ``Research at Xerox PARC: a founder's assessment'', IEEE Spectrum, October 1985.

[Stefik] Mark Stefik, ``Letting Loose the Light: Igniting Commerce in Electronic Publication,'' in Mark Stefik (Ed.), Internet Dreams: Archetypes, Myths, and Metaphors. Cambridge, Mass: MIT Press, (October 1996)

[Suchman] Lucy Suchman, ``Making Work Visible,'' Communications of the ACM, 38 (9), 56-64, September 1995.

[Van House] Nancy Van House, ``User Needs Assessment and Evaluation for the UC Berkeley Electronic Environmental Library Project,'' in the Proceedings of Digital Libraries '95: The Second International Conference on the Theory and Practice of Digital Libraries, June 11-13, 1995, Austin Texas. http://www.csdl.tamu.edu/DL95/contents.html

Marti Hearst
Monday, May 13, 1996

Copyright © 1996 Xerox Corporation. Permission to copy without fee of this material is granted provided that the copies are not made or distributed for direct commercial advantage, this copyright notice and the title of this publication and its date appear. To copy otherwise, or republish, requires a fee and/or specific permission.

D-Lib Home Page |  D-Lib Magazine    
Contents Page | Comments
Next Story

hdl://cnri.dlib/may96-hearst