Safeguarding Digital Library Contents and Users

Assuring Convenient Security and Data Quality

H.M. Gladney and J.B. Lotspiech
IBM Almaden Research Center
San Jose, California 95120-6099
gladney,[email protected]

D-Lib Magazine, May 1997

ISSN 1082-9873

Abstract. Digital library (DL) services must protect copyright holders, owners, users, and themselves against deliberate and inadvertent misuses of contents. Unobtrusive programs can provide sufficient protection, yet spare users irksome choices between distracting inconveniences and imprudent risks. Which safeguards are important depends on the kind of information resource and on the circumstances of its use.
DL service can be regarded as a dramatic improvement of certain aspects of traditional library services. What computer scientists mean by integrity is similar to what librarians mean by authenticity; confidentiality is a basis for privacy; and security is a basis for maintaining authors' and owners' intellectual property rights. Collectively, confidentiality, integrity, and security are what we mean by data quality; DL service can deliver quality with good performance, convenience, and economy. We sketch some risks, identify generic mitigations, and show direction towards cost-effective and convenient solutions.
DL protection extends well-known computing security practices. Novel elements include means of representing contractual obligations associated with information ownership, means of integrating different access policies chosen by different communities, means for handling very large numbers of objects and access rules, means for reducing loss outside library protection perimeters, and means for doing these and other tasks without adding to people's administrative burdens.

Introduction

Intellectual property policy is being debated [Roscheisen95] in venues that will affect litigation and legislation [Lehman95]. As can be seen from news about information moved across national borders, policies will evolve differently in different political and contractual domains. Different mixes of protective measures will be required in different venues, for different kinds of information, and for various administrative and digital system circumstances. An objective of the research and development group to which the authors belong is customizable software combinations which support whatever policy choices are made by those with authority to make them.

Safeguarding the economic, quality, and confidentiality interests of copyright holders and end users is an essential DL service. DL applications span widely disparate content types, values, origins, and longevities, and widely disparate user purposes and operating environments. What a multi-campus university with 100,000 students wants is different from what a movie studio with 100 film editors wants, and different from what business information users want [Choy96b]. Content owners, librarians, and information consumers teach that we must address all important kinds of data, all important kinds of service, and all important risks.

Intellectual property management is often construed as protecting the rights of document owners to collect revenue. (In this report, document means any package of information that can be conveyed digitally, e.g., a film together with associated administrative information, pre-recorded music, a set of web pages, etc.) This viewpoint is too narrow. Readers must be confident of their privacy and of information authenticity, i.e., that information comes from purported sources and is not fraudulently altered. Publishers need authenticity controls to protect against lawsuits, flexibility to create their own product look and feel, and automatic tools to protect data from unauthorized use. Vocal university groups insist that "fair use" be continued, with legislative affirmation. Other university groups, concerned with escalating prices of the primary research periodicals, are launching on-line journals for which costs may have to be recovered by subscription fees.

We have for several years been collecting and analyzing what people say they need and are devising a comprehensive framework. We plan modularity framework which allows implementation of any portion just in time for the applications it enables. Our solution will be flexible, allowing whatever protection policies and royalty schemes enterprises and applicable law require. It will be open, allowing inter-operation among standard platforms and exploitation of offerings from many vendors. We know that assuring the quality of information delivery and measuring its flow cannot be accomplished without basic computer and communication security. Thus, our intellectual property protection will integrate complementary tools which impede all likely forms of deliberate and inadvertent abuse.

Through pilot implementations (as in an IBM joint study with a Case Western Reserve University team), we have been exploring the intricacies of applicable practice and are developing prototype components for certain service and data types. We are considering combining in a new prototype the best from several studies including mechanisms such as:

WWW browsers providing as much protection against unauthorized copying as is feasible;
Envelopes for safe document delivery, administration, and authentication;
Key management for superdistribution; ("superdistribution" is open dissemination of information in a form which is useless to each receiver until (s)he obtains keys from a publisher's agent. This allows unhindered information redistribution without interfering with the copyright holder's ability to collect revenue.)
Document marking for identification of sources and destinations;
Back-office support which helps administer contracts with information owners;
Interfaces with funds-transfer components of electronic commerce systems; and
Management of different document versions requiring different protections.

In what follows we will sketch each of several qualities of distributed information, or its opposite, a risk of degradation or misuse and then suggest mitigating software. In each case, the program must execute in a context which includes physical barriers, such as "the glass wall" of a computing center, and administrative deterrents, such as punishments for offenders, but we will not mention these further, because they are mostly obvious and well-known.

No reader should expect 100% protection of anything. Safeguards are nearly always imperfect. Thus a realistic objective is to make misappropriation economically unattractive or, alternatively, to maximize benefits to the property owner, as suggested by Figure 1.

Figure 1. Balance between too little and too much protection: revenue is nil with either no security or too much security; from the mean value theorem, given a positive revenue point, there is also a maximal revenue.

Structure of Library and Document Protection

In what follows, we sketch an orderly identification of users' expectations for quality data delivery and of what shortfalls can occur. In each case, we indicate how a specific mechanism can reduce risks. The mechanism choices are mostly determined by where in the delivery system the risks arise; the choice of specific mechanism from a class is sensitive to the value of data at risk and to user preferences.

Envelopes For Distributing Documents And Administrative Addenda

For received information, the viewer V would like to know that (s)he has at hand "the real McCoy, the whole real McCoy, and nothing but the real McCoy", and would like to know that it is McCoy's mother who assures him/her that it is McCoy (s)he is seeing. Since the lady who says she is McCoy's mother might be lying, V might ask her to show some trusted agency's certificate of her identity [Blaze96]; a driver's license with picture and signature, all encased in plastic, might be enough.

V might want, if there is a fee for seeing McCoy, to know the price before agreeing to pay it. The theater owner and Mrs. McCoy want to be sure that they get the fee for each visit, and they might want a demo-graphic survey of McCoy visitors without violating their anonymity, e.g., so that neither V's spouse nor V's worst enemy knows or can prove that V saw McCoy.

We can come close to accomplishing all this by shrouding all information about visits in carefully labeled wrappings, adding new files which package administrative rules, and locking all these things up with a package of rule keys.

Bill of Materials

Clear Text "teaser" (HTML)

Encrypted fingerprinting and watermarking instructions

Encrypted document part Key record

Encrypted document part Key record

Encrypted document part Key record

Terms and Conditions

Integrity protection and signatures

Figure 2. Information packing, called a Cryptolope(r)

The digital equivalent of a McCoy package is wrappings for file sets and tokens which ensure that any change to the package can be detected. We have adopted a packaging (Figure 2) which includes all necessary administrative information, such as terms and conditions of access and protected signatures used to validate authenticity. The primary content is compressed and encrypted. Each package, called a Cryptolope(r), can include information so that a user can determine not only the price of any protected piece, but also its value; for instance an encrypted financial analyst's report might be accompanied by its abstract in the clear, and a feature movie by the kind of preview theaters commonly show. Each package includes a bill of materials (BOM), digital signatures and cryptographic checksums of document portions, clear text elements sufficient to convey the value of encrypted document portions, and so on. The only obligatory portion is the BOM.

This packaging enables superdistribution--information delivery before it is needed, possibly by inexpensive channels (e.g., network services at off hours). Cryptolopes can be safely delivered over broadcast channels, or by CD-ROM. In fact, superdistribution has been used for over 5 years for software delivery. CD-ROMs are delivered by the post-office; if you want part of one's valuable content, you pay for its encryption key.

A Cryptolope can safeguard information it does not carry as surely as it can safeguard embedded, encrypted document parts.

Managing Secrets to Interpret Cryptographic Envelopes

In one of our implementations, when a user decides to accept the terms and conditions of document use, a fast exchange of encryption, authentication, and financial data occurs between his workstation, the library service, and an associated clearance centers. The system decrypts the information granted and makes whatever accounting records are called for (Figure 3). We do this safely by a combination of public and private key encryption. This avoids burdening anyone by demanding a password or any other secret beyond what was asked as part of an authenticated login. The figure suggests the network protocols and processing elements which accomplish this efficiently. This realization provides an interactive dialogue which opens document portions after their owners' conditions are met, e.g., by checking that the user has a subscription.

Figure 3. Safe document retrieval in two-tier delivery: unfortunately for this scheme, hardware to make workstations trustworthy will not be available soon.

This particular scheme is practical with honest users, and also with dishonest ones who cannot manipulate specialized workstation software, but it can easily be subverted by a determined university student. It cannot provide good security without impervious key management built into every workstation. Such kernels will not be widely deployed in the foreseeable future.

The Three Level Hierarchy

The problem with which the prior section ended can be solved [Choy96] by adding isolating way-houses we call campus servers. We will manage three kinds of document stores: source libraries (alpha in Figure 4, A/A' in Figure 5),which belong to publishers or other information owners; campus libraries(beta in Figure 4, B in Figure 5), which play a similar role to university or city libraries, distributing copies obtained from source libraries; and end-users' workstations.

Documents flow from their owners toward their users. Figure 4 suggests all the pertinent flows except for flows within each box, e.g., transfers among document owners. People considering rights management tend to focus on alpha--beta, alpha�gamma, and beta�gamma transactions. Figure 4 reminds us to include delta�beta transactions (e.g., a professor giving class notes to the university library) and delta�alpha transactions (e.g., an author dealing with her publisher). We ignore the delta�gamma. path because we do not expect to mediate it.

Figure 4. Document distribution pathways: each pathway potentially has different control needs.

Document distribution is executed by a hierarchy of computer processes as suggested by Figure 5. Higher hierarchical levels are clients of lower level processes, e.g., Campus Library Services (B) is a client of Source Library Services (A). A lower level server cannot distinguish such a client process from a human user (except that it may seem to be a voracious reader). The client process is subject to the same kind of authorization controls as a human user of the lower level might be; e.g., the storage subsystem in (A) connects to a database in (A') using user identity authentication provided by system components we do not discuss. Each higher level gives more constrained information access than the level it depends on, i.e., it filters the information from lower levels.

Figure 5. Document distribution hierarchy: documents and metadata can be stored into libraries only by users of the (A) and (A') layers, from which they are delivered to each customer's library (B) and eventually to end user's environments (C).

The campus library level is optional. End users might access source libraries directly. This is how we expect a video production studio to operate. The campus library level conveys several benefits:

It enables immense performance improvements for large user groups distant from their source libraries;
It makes it easy for a library service to preserve the anonymity of its readers;
It provides a single point of user authentication for readers of subscriptions. For example, a university library might want to register each student as a byproduct of class registration; and
It allows us to emulate the common practice of institutional subscriptions to which free access is granted to institution members.

Figure 6. Document distribution network: suggesting that having multiple document sources forces us to find a unique document-naming scheme.

Figure 5 depicts one instance of each client/server pair. In fact each source library service must supply many distribution services, possibly with keys which depend on the distribution service even for the same content objects (Figure 6). Each campus library service may draw on several source services. Clients mostly use campus library services, but can use source library services directly. Campus library services can be cascaded.

Key management that overcomes the weakness associated with Figure 3 is beyond the scope of the current article. We hope to describe that [Lotspiech97] in a future D-Lib article.

Databases For Access Control And Royalty Management

Cryptolopes, key management, and contractual obligations accepted by campus library managers provide most of the software needed to administer rights management in campus libraries and in users' workstations. (This assumes that well-known computer and communications security measures are prudently managed and properly administered in source and campus libraries.) However, publishers need to keep for many years records of what they have promised each copyright holder and to make this information available to their editors whenever material is reused; this is a complex data management challenge. We are preparing database schema and manipulators for such rules and using them to construct Cryptolopes (Figure 3, top right). Such databases keep track of the terms and conditions under which each document is held in the library and made accessible to users.

In general, each source library has many users authorized to add to the collection or modify its contents. Such changes might include many versions of each original document. For some collections (such as original video "takes"), the authorized users themselves are a source of significant risk; that's why the military talks of "need to know". Thus, to whatever basic catalog is needed to keep track of the history, location, and interrelation-ships of document parts, information must be held to guard against violations by update users and also to record the permission rules for distribution of each document part.

In our source library catalog we include access control information which is similar to but more elaborate than typical computing file system access control data [Gladney92]. We further provide a database for permissions management (terms, conditions, prices, provenances, etc.) which represents what is typically found in contracts, copyrights statements, and similar, legally significant documents.

Language in Cryptolopes, at User Interface, and in Db

Copyright holders' rules for re-use of their materials can be complex and idiosyncratic, and are binding for as long as a century. Consider the administrative problems associated with an operetta whose favorite tunes are wanted for advertising background. What we have described so far can be used to ensure that the terms and conditions remain safely bound to the valuable material, but has not yet said anything about how the terms and conditions are expressed.

We can distinguish at least three domains in which we need language to articulate terms and conditions:

To transport such rules among stores, as part of Cryptolopes (near bottom of Figure 2);
To store rules in document databases; and
To help authors and other copyright holders to express their terms and conditions for injection into these databases.

These languages need not be identical; in fact, what is best is different in each domain. Stefik [Stefik97] at Xerox PARC, Barker at CWRU, and Walker [Walker95] in IBM Research have prototypes in the respective domains. We believe them semantically compatible and are considering putting them together in future prototypes.

Document Marking to Dissuade Misuse

Tight protection in users' workstations will not always achieve the best balance between widespread document usage and revenue. Document marking, coupled with visible warnings and moral or legal sanctions, can help reduce risks to levels acceptable for the materials and applications at hand.

Marking can be patent or hidden (not easily seen without instruments),as on U.S. currency, or secret (not apparent or interpretable without comparison with an unmarked original). Marking can be used to identify the source of a document (often called a "watermark"), or to identify to whom the library delivered a document (often called a "fingerprint"). Methods of marking must be different for different kinds of data; what works for a photograph is different from what works for a radio performance. Moreover, choice of document marking might depend on anticipating the user's intent; visible marks are unacceptable for a photograph to be reproduced in a magazine.

Several software companies have developed markings for leading edge customers and are developing further markings. For example, we have visible watermarks which permit medium-quality renditions of rare-manuscript art to be delivered by Internet for viewing with widely available WWW browsers [Mintzer96]. This supports scholarship by providing access to rare and unique materials for phases of the work in which exact replicas are not needed. We also have invisible fingerprints for printed page images so that pirated copies can be traced to their distribution points, and visible fingerprints which are durable under photocopying (Figure 7).

Figure 7. Two-dimensional barcode: an example of a visible fingerprint.

Writers on marking technology tend to focus on resistance to individual malefactors, non-disturbance of the intended purpose of the document receiver, and on robustness under transformations such as scaling, cropping, printing followed by scanning, and so on. Dwork [Dwork97] reminds us that this is not enough, because if information is widely distributed the logistics may overwhelm an attempt to enforce compliance to copyright holders' restrictions. For example, if a fine rendition of fine art is available on the WWW, and downloaded by thousands of users who might each forward it to many friends, how is one to know which WWW users to suspect, and even if one knows, how is one to prove it--without invading their machines?

Even when an offending user is caught "with the goods", it may be impossible in a court of law to establish unequivocally that he rather than someone else misbehaved. If a watermark is removed, how can one prove who removed it? These two examples suggest that much thinking is ahead of us before we can confidently assert that we know what marking accomplishes and what exposures will remain forever. Nevertheless, marking is a significant deterrent, especially against formal republication.

Work in Progress

This is a brief and early report of incomplete work. It concentrates on technical aspects of intellectual property management, leaving legal, political, and economic recommendations to writers better qualified for such aspects. Our technical direction follows from an observation that services must be distributed into three domain classes: source domains within which authors and publishers manage original works, secondary domains which obtain copies from source domains for access by end users, and end user work-stations. Complementing well-known basic security tools, the key elements for digital rights management are databases for administering rules expressing policies, document interchange structures which combine encrypted document parts and administrative information, and document markings which inhibit misuse outside library perimeters.

We are preparing an article [Lotspiech97] to provide more depth on how we manage encryption keys so that no user is inconvenienced beyond a possible initial network login, how we arrange that trust is propagated almost automatically from a few impeccable institutions, and how we guarantee the authenticity and provenance of information received.

Over time, our IBM colleagues and we will describe more carefully the protection components we touched on and more [Anderson95]. As part of working out a framework within which all known requirements can be addressed (to the extent that this is theoretically feasible), we are examining many service circumstances and malfeasance scenarios. We hope to share the more interesting among these as we expose the architecture to critical review.

Acknowledgments.

This material is drawn from the work of many colleagues, including most prominently L. Anderson, J. Barker, C. Dwork, M. Kaplan, M.Kline, L. Scarborough, J. McCrossin, F. Mintzer, N. Morimoto, and their associates. We are grateful for permission to include their ideas without tedious citations.

References

L.C. Anderson and J.B. Lotspiech, Rights Management and Security in the Electronic Library, Bull. Am. Soc. Inf. Science 22(1), 21-23,(Oct./Nov. 1995).

M. Blaze, J. Feigenbaum, and J. Lacy, Decentralized Trust Management, Proc. IEEE Conference on Security and Privacy, Oakland, CA, (May 1996).

D.M. Choy, J.B. Lotspiech, L.C. Anderson, S.K. Boyer, R. Dievendorff, C. Dwork, T.D. Griffin, B.A. Hoenig, M.K. Jackson, W. Kaka, J.M. McCrossin, A.M. Miller, R.J.T. Morris, and N.J. Pass, A Digital Library System for Periodicals Distribution, Proceedings of ADL96, (May 1996).

D.M. Choy and R.J.T. Morris, Services and Architectures for Electronic Publishing, Proc. IEEE COMPCON'96, 291-297, (1996).

C. Dwork, Copyright? Protection? Seminar given at IBM Almaden Research Center, (May 1996).

H.M. Gladney, Access Control for Large Collections, IBM Research Report RJ 8946, (August 1992); to appear in ACM Trans. Info. Systems, (April 1997).

B.A. Lehman (U.S. Asst. Secy. of Commerce) et al., Intellectual Property and the National Information Infrastructure, available from Office of Legislative and International Affairs, U.S. Patent and Trademark Office, Washington, D.C. See also <http://iitf.doc.gov/>. This administration bill is controversial.

J. Lotspiech, U. Kohl, and M.A. Kaplan, Cryptographic Envelopes and the Digital Library, IBM Research Report RJ 10069, (1997). To be presented at Verlaessliche Informationssysteme, German Computer Society (GI), Freiburg, Germany, (9/29 to 10/2/97).

F.C. Mintzer, L.E. Boyle, A.N. Cazes, B.S. Christian, S.C. Cox, F.P. Giordano, H.M. Gladney, J.C. Lee, M.L. Kelmanson, A.C. Lirani, K.A. Magerlein, A.M.B. Pavani, and F. Schiattarella, Towards On-Line Worldwide Access to Vatican Library Materials, IBM J. Research and Development 40(2), 139-162, (March 1996). Also visible at <http://www.almaden.ibm.com/journal/rd/min tz/mintzer.html>.

M. Roscheisen, T. Stanley, and S. Stern, legal resources, <http://www.findlaw.com/>. For intellectual property law, see particularly <.../01topics/23intellectprop/index.html>.

M. Stefik, Trusted Systems, Scientific American 276(3), 78-81, (1997).

A. Walker, Proposal for Rights, Billing, and Mass Customization in Digital Library and e-Commerce, private communication, (1995).

Copyright and Disclaimer Notice

(C) Copyright IBM Corp. 1997. All Rights Reserved. Copies may be printed and distributed, provided that no changes are made to the content, that the entire document including the attribution header and this copyright notice is printed or distributed, and that this is done free of charge.

We have written for the usual reasons of scholarly communication. This report does allude to technologies in early phases of definition and development, including IBM property partially implemented in products. However, the information it provides is strictly on an as-is basis, without express or implied warranty of any kind, and without express or implied commitment to implement anything described or alluded to or provide any product or service. IBM reserves the right to change its plans, designs, and defined interfaces at any time. Therefore, use of the information in this report is at the reader's own risk.

Intellectual property management is fraught with policy, legal, and economic issues. Nothing in this report should be construed as an adoption by IBM of any policy position or recommendation.

hdl:cnri.dlib/may97-gladney