On Making and Identifying a "Copy"

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
January 2003

Volume 9 Number 1

ISSN 1082-9873

On Making and Identifying a "Copy"

Norman Paskin
International DOI Foundation
<n.paskin@doi.org>

	Identifiers (unique labels for entities) and metadata (structured relationships between identified entities) are prerequisites for Digital Rights Management (DRM) [1]. Identifiers and metadata enable precise specification of an entity. The essence of digital rights management is the control (licensing, etc.) of copies of entities; the identifiers and metadata are then essential to the management of this process, and to distinguishing and expressing relationships such as replicas and derivations. In the current DRM environment, "any widely distributed object will be available to a fraction of users in a form that permits copying; users will copy objects if it is possible and interesting to do so; users are connected by high bandwidth channels" [2]. In this article I do not use the term "copy" in necessarily the specific sense of some copyright legislation, where copy may be specifically defined to be a material object: the term is used in the generic sense of an imitation or reproduction of an original; a duplicate. However the same considerations will be relevant to considerations of copyright in defining "what is a copy" by means of defining attributes. In the physical world, the concept of a copy is intuitively understood at a sufficient level to avoid problems of interpretation (though some cases might require close examination); this is not so in the digital world. The purpose of this article is to explore some of the conceptual issues of "making a copy" digitally and in an automated DRM environment, which need to be clarified to enable rights expression languages and other DRM tools to deal with the concept of copying. We conclude that two digital entities are never the same in any absolute sense and can be considered copies of each other only in the context of some defined purpose. This naturally has implications for the assignment and use of identifiers and other metadata, which are two sides of the same coin. Entities of interest in digital copying: abstractions An identifier is an unambiguous string denoting an entity; an item of metadata is "a relationship that someone claims to exist between two entities" [3], each of which may have an identifier—and must, in an automated environment. These entities may include both objects and concepts: e.g., an item of metadata may be "I state that this book has a blue cover", and that blue may be specifically identified by a Pantone number: both the book and "blue" would be identified entities. An entity is a term used to mean simply something that is identified. The underlying idea, from the indecs project, is that nothing exists in any useful sense until it is identified (computers are dumb, and ignorant of anything that is not explicitly stated to them). An ontology is a tool that is able to structure relationships between entities; an explicit formal specification of how to represent the entities that are assumed to exist in some area of interest and the relationships that hold among them [4]. Schemes that consider the entities of interest in the digital library world include specification of digital representations. For example the MPEG-21 Multimedia Framework world [5] consists of Users who interact with Digital Items. A Digital Item can be anything from an elemental piece of content (a single picture, a sound track) to a complete collection of audiovisual works: an MPEG "digital item" can be considered a sub-set of what IETF refers to as Resources in URI/URN addressing schemes, and what has been elsewhere defined as a Digital Object "a data structure whose principal components are digital material, or data, plus a unique identifier for this material" [6]. However for the purposes of digital rights management, it is essential to consider not only digital entities, but all entities, since digital entities may be derived from or related to non-digital entities that determine the rights associated with them. (DRM is digital management of [all] rights; not management of digital rights.) In fact, in most cases when an intellectual property entity is identified, the entity is not tangible, or even digital, but an abstraction. Clearly, this is the case when identifying abstractions such as the underlying work "Robinson Crusoe" that has many different manifestations as book editions, or "The Eroica Symphony" in many recordings, scores, and performances. Not as readily appreciated is that apparently "tangible" entities are also abstractions: e.g., the ISBN identifies not the copy of the book that you have in your hand, but the class of all such copies, an abstraction. Abstractions can only be perceived through their manifestations (an individual book; an individual recording); which is why we often confuse the two. Abstractions require an ontology to define them and understand their relationship to their manifestations. More than one ontology can provide tools for dealing with any set of entities, but we need to be careful not to mix definitions from different ontologies without careful mapping: every schema has its own inherent contextual model and its elements are defined in those terms. For example, there is a fundamental difference in the way in which the library-derived FRBR model [7] defines the term "expression" and the way the <indecs> Metadata framework [8] defines "expression", but this is not to say that only one is right: each recognizes the entity that the other is calling "an expression" and wishes that the other had called it "foo". Mapping elements is a completely different and much more complex process than declaring data elements. The indecs/DOI/ONIX group, for example, can map more or less any other schema successful within their models, but we would not assume that any other schema would adopt the same definitions of (say) agent, resource or event. It has been well said that "there are more abstractions than are ever conceived of". Identity and Sameness Currently we use the term "copy" too loosely for automation. We know what is meant by "it's the same, but it has a different identifier" or "I will give the copy a different identifier (i.e., because it is my copy and hence has different attributes (ownership) than your copy)". But all copies are "identical" by definition, aren't they? This leads to logical contradictions when applied to automated systems. A fundamental purpose of identifiers is to define when two things are "the same" and hence denoted by the same identifier: true copies. The intuitive meaning of "the same" needs some logical analysis if it is to be applied consistently for automation. The word 'same' is used sometimes to indicate similarity (qualitative sameness), as in 'Alice is the same age as Bob, and the same height as last year', sometimes to indicate that what is named twice should be counted once (numerical sameness), as in 'The morning star and the evening star are the same planet'. The word 'identical' can also have the former sense (identical twins, identical dresses) as well as the latter; hence philosophers are liable to discuss both kinds of sameness under the label 'identity'. Qualitative sameness is a comparison of metadata: entity A and B share a relationship to entity C. Numerical sameness is a simple logical relation through comparison of identifiers, in which each thing stands only to itself. "Although everything is what it is and not anything else, philosophers try to formulate more precisely the criteria by means of which we may be sure that one and the same thing is cognised under two different descriptions or at two distinct times" [9]. Numerical sameness leads to a trap for the unwary: if we say, "Two entities are the same if they have the same identifier," we seem to create a puzzle: how can they be two if they are the same? If identity is a relation it must hold either between two distinct things or between a thing and itself. To say that A is the same as B, when A and B are distinct, is bound to be false; but to say that A is the same as A is to utter a tautology. "Roughly speaking, to say of two things that they are identical is nonsense, and to say of one thing that it is identical with itself is to say nothing at all." [10]. Different solutions have been found by different philosophers for this "paradox of identity". This may seem like remote philosophising, but in fact lies at the heart of practical implementations of "copies". The crux of the problem is that in determining whether A is the same as B, we find that ultimately nothing is the same as something else; however, it makes sense to consider that A is the same as B for a defined purpose (i.e., in a defined context). A photocopy of this article is not the same as the original in some ways (it is printed on different paper stock, it is located in a different part of space, etc.); but it might be considered the same—a copy—for the purposes of intellectual property (it retains the typographical layout and semantic sense). Here, the attribute "paper stock" is irrelevant, the attribute "manifestation of the defined work X" is relevant, for the purpose of DRM. Whilst this seems almost trivial in a physical environment, where the purpose and context are intuitively understood even if not stated, in a fully automated digital environment the attributes and context are less intuitive. Two photocopies that are "the same" for the purposes of intellectual property may be different for another purpose (e.g., for the purpose of forensic examination, if one has my fingerprints on it and the other does not). It follows that the attributes one assigns to an entity will depend upon the purpose or application that is expected to be executed with that entity (the Digital Object Identifier (DOI) defines a set of these attributes as an "Application Profile"). This is why it is difficult to translate intuitive concepts from the physical world into the digital; a definition of "to copy" in the digital environment makes no sense without a context. In recent MPEG-21 discussions, some technologists argued that there can be "no such thing as a digital copy"—A and B must differ because of the sequence in which their data representations are laid down on a hard disk, for example. This is true, yet it clearly is nonsense to say that "the action of copying is impossible in the digital domain": this would undermine copyright law as rampant copying is patently occurring in practice. Hard disk sequencing is an irrelevant attribute for the purpose of intellectual property (IP) law—though case law in this area is sparse—and similarly, in more traditional IP interests, photocopier technologists are not ideal intellectual property lawyers. Any replica (copy) is a derivation (a "copy but with some changes") when examined sufficiently closely. So it is meaningless to ask "Are A and B the same thing?" and only meaningful to ask "Are A and B the same thing for the purpose of....". For purposes of automation we do this by considering which attributes of A need to be retained in creating the replica B; some attributes are ignored, considered irrelevant for some defined purpose. A description is a set of properties that apply to a certain object: two incomplete descriptions denote the same object if they have an identifying property in common [11]; the descriptions are for a purpose, and the "identifying property" (or more likely set of properties) is the one by which we define that common purpose or context of the A and B comparison. When we make statements, we normally leave a great many attributes unstated because we assume general or specific knowledge on the part of our audience. However, when we come to fully automated DRM, which relies on exchange between computer systems, we cannot expect that any inferences from "common knowledge" will be applied. We need to consider an entity as no more than the sum of its stated attributes. I may say you can copy my CD and its entire contents and sell it in a jewel box: exactly what kind of jewel box, and what the printing on the CD and the inlay says is irrelevant to the copy. It is a replica if the stated attributes are the same at whatever level of granularity is explicit. It may even be a copy if it is not a CD, if the only stated attribute I have given is "this recording". DRM will rely on the same principle as any other computer system: computers are dumb, and if something is not specified it cannot be taken into account. These notions fit in with the digital library community experience over the years of the difference between dealing with the fixed and familiar physical world, versus the fluid and hidden digital world, which needs to be made more explicit. While you can use a car repair manual for a door stop, you know its intended use from looking at it. A digital object, a "bag of bits", on the other hand, offers no such clues and so has to be labeled 'repair manual' and has to carry its own operating instructions or some sort of known clue or pointer to its operating instructions. The same principle of considering a comparison relevant for some purpose applies to the use of metadata in automated applications: we must sort the metadata into sets (Application Profiles) that are relevant for the particular purpose of that application. As Karl Popper elucidated, there is no neutral purpose-free "tabula rasa", always a purpose that is inherent in a particular act of perception [12]. The recognition that all considerations of identity require recognition of context is fundamental to the context model underlying the indecs Data Dictionary, now the basis of the MPEG-21 Rights Data Dictionary, in which all are things are ultimately part of events or situations, taking place in defined contexts. Granularity The paradox of identity is related to the concept of recognising granularity. Recognising sameness among a population, as we have seen, depends on choosing which particular set of attributes of a number of entities we consider relevant, and that are irrelevant, and ordering the population into sets defined by the relevant attributes for the purpose in hand. Granularity refers to the level of content detail identified; and to this we must add again the qualifier "identified for a particular purpose". To take an example from text publishing, the ISBN [13] identifies the whole book; the BICI [14] identifies component parts of the book (e.g., chapters, sections, illustrations, tables). This may be enough for some uses but is clearly inadequate for others. If we are to be able to identify all rights owners in a particular piece of content, that may require a far finer degree of granularity of identification, to the level of the individual illustration or quotation from another source. Similarly, if information is to be traded with customers at a level of granularity finer than the "chapter" or the "article", then publishers may have compelling marketing reasons for being able properly to identify and to keep track of what is being traded. The level of granularity that may need to be identified becomes effectively arbitrary in a digital environment. This might suggest a requirement for relational identification where (like the BICI) smaller fragments are identified by reference to the larger "whole" from which they come, although this "intelligence" would have some drawbacks, not least in terms of the size and structure of the codes, and a preferable route would be to express the relationship through readily accessible metadata. Considerations of granularity are fundamental to a logical analysis of DRM, and a key point is the purpose and context of the granularity choice. Functional Granularity The indecs Principle of Functional Granularity is that "it should be possible to identify an entity whenever it needs to be distinguished." When should an identifier be issued? In this deceptively simple question lies the most basic question of metadata: for which data is it meta-? Resources can be viewed in an infinite number of complex ways. Taking the indecs metadata framework document as an example, it has an identifier in the <indecs> domain: WP1a-006-2.0. But to what does this refer? Does it refer to the original Word document, or to a pdf version available on the Website? Or does it refer to the underlying "abstract" content irrespective of delivery format? If it refers to the Web document, is this also adequate as a reference to local copies that have been downloaded onto other computers or servers? The document's parts may require identification at any level (for example, section 2.2, or Diagram 14). If you wish to make a precise reference to a sentence from another document, you will need a more precise locator, and its nature will depend on whether your reference is intended to allow automated linking. As the document has been through many stages of preparation, how many different versions need to be separately recorded? Each of these requires the exercise of functional granularity: the provision of a way (or ways) of identifying parts and versions whenever the practical need arises. The application of functional granularity depends on a huge range of factors, including the type of resource, its location in time and place, its precise composition and condition, the uses to which it is or may be put, its volatility, its process of creation, and the identity of the party identifying it. The implication of this is that a resource may have any number of identifiers. The same entity may be subjected to functional granularity across a range of views. The basic "elements" of a resource may be entirely different according to your purpose. Something may be analysed, for example, in terms of molecular entities (chemistry), particles such as electrons, quarks or superstrings (physics), spatial co-ordinates (geography), biological functions (biology, medicine), genres of expression (creations), price categories (commerce), and so on. In the digital environment, things can be relatively easily managed at extreme levels of granularity as minute as a single bit. Each of these processes will apply identifiers of different types at different levels of (functional) granularity in different "dimensions"; these may need to be reconciled to one another at a point of higher granularity. Functional granularity does not propose that every possible part and version is identified: only that the means exists to identify any possible part or version when the occasion arises. Identification is not the same as mark-up, though if a section is distinguishable by some mark-up coding it will be subsequently easier to specify it as separately identified. Conflicting Views of Granularity: Difference within Sameness What is "the same thing" ("a copy") for one user, purpose, or context will be "two different things" for another. The two users may have different purposes in mind when they ask "are X and Y the same?"; and as we have seen, this question is implicitly "are X and Y the same for the purpose of...?" Failure to comprehend these different views (purposes) across a supply chain results in considerable friction. Some practical examples will illustrate this. For clarity, I refer in each case to two different users—the party who sees "the same thing" as X and the party who sees "two different things" as Y. There has been much discussion (as yet not fully resolved) of this in the context of eBooks [15]: publisher X wishes to use one identifier (the ISBN) to refer to all technical formats of an eBook, since they are all "the same book" (the Open e Book Foundation defines such a publication as "the digital content you read: a paperless version of a book, article or other document" [16]); yet supplier Y needs to distinguish different formats (a customer ordering one format wants that and no other). Some publishers have in fact suggested using the ISBN with some form of qualifier (or parameter) to do this; the International ISBN agency prefers to recommend different ISBNs for each format [17]. These are the two general approaches to recognising difference within sameness, each of which may be valid in some circumstances: a "single identifier with qualifier" or "create new multiple fixed identifiers". The "single identifier with qualifier" approach is used in solving the "appropriate copy" problem in one application with DOIs [18]. The generalised case is that since an identifier is normally that of a class (an abstraction), it is assumed that each member of the class is equivalent; but in reality this may not be so in all contexts, and there are many instances when more than one legitimate copy is available, and some copies are not available, due to the context of the request. In the appropriate copy example, publisher X allocates one identifier to an article; library user Y finds that because of local loading, aggregator databases, paper copies or mirror copies, she needs to distinguish copy one from another; in each of these cases, the address to which the identifier given by X should appropriately resolve depends on the location or affiliation (in general, the context) of the user Y who is making the resolution request. To solve this problem, it makes sense to contextualise the use of the identifier by some tool such as OpenURL. A full analysis of any transaction, in the further work done using indecs for MPEG [19], shows that ultimately all transactions are contextual and can be expressed as an event or a situation; and a full analysis of the use of identifiers will show that ultimately, of course, they are all used in some context. The "create new multiple fixed identifiers" approach is shown in the emergence of the ISTC (International Standard Text Code) [20]. New identifiers may be needed and require the creation of a new namespace if the namespace currently being used cannot satisfactorily include a new type of entity without disrupting the existing business. A good example is the identification of textual abstractions and the identification of their manifestations (books): ISBNs are in widespread use for identifying (separately) each different edition of e.g., Cervantes' Don Quixote. These are different (if customer Y orders the leather bound limited edition with illustrations by Dali, he is unlikely to be happy to receive the $1.50 Worlds Classics paperback edition). Yet authors agencies, rights organisations, and librarians X may all be interested in the general work and not concerned with specific editions for some purposes (a library reader wishing to find a copy of the work, for example). This led to the development (with the full collaboration of the ISBN agency) of a new identifier, the ISTC, which can be used to identify this entity (the textual abstraction). This example also usefully shows that it is not always the smaller granularity entities that the driver for the creation of new identifiers: in this case, a new identifier is required that may be related to "supersets" of ISBNs. These two ways of dealing with "difference within sameness" are not always clear black-and-white alternatives, and once again functional granularity will be the arbiter of which to use in which cases: is there a need to agree on a separate identification scheme (a new namespace), or can we live with the difference being defined by qualification after the identification step at a local level, which is not likely to be widely used across a supply chain? If the entities being finely differentiated are the objects of commercial transactions across multiple partners, or are likely to be stored and used in communication to identify precisely the differentiated entity (rather than the unqualified entity), then I believe the separate new identifiers approach is likely to be optimal in the long term. In each solution, the same logic applies: whether we refer to them as "a qualified identifier with two different qualifiers" or "two identifiers that have a relation" is semantics: "ISBN 1234" and "ISBN1234-as qualified-Z" are separate strings. They denote different entities. They must do. Otherwise there wouldn't be a need for two strings. It may well be that party X only needs the first, but if party Y has a need to deal with all these different transformations generated by X at a business level and needs to know the various sub "qualified" identifiers, then Y is going to end up having to store the [qualified] identifiers and treat them as static separate strings, i.e., separate identifiers—probably in a separate database because the particular numbering system X has used isn't sufficiently granular for Y's needs. If entities need at some point to be differentiated for long-term purposes (which typically they do in any DRM chain for, e.g., audit, etc.), then inescapably someone somewhere will be managing multiple identifiers [strings] with multiple metadata [as there are multiple entities] that have a defined relationship. This need not be a concern if that management is in an isolated internal database, but increasingly such data is becoming exposed to interoperability, the heart of DRM. Wherever this happens, this is easier to do by treating all differentiable entities as having fixed identifiers—persistent opaque strings with associated data—rather than some as derived by qualification. This allows a common mechanism for persistence, registration, and interoperability. There are many related identifier labels (namespaces) and no one can deal with all possible needs—this is why ISTC had to be added on top of ISBN, rather than overloading one system and asking it do two fundamentally opposing jobs; an identifier system or framework that can contain all these, such as DOI, is making more and more sense. Authors note An expanded version of this article is to appear as part of the forthcoming chapter "DRM Technologies: Identification and Metadata" of the volume Digital Rights Management: Technical, Economical, Juridical, and Political Aspects (ed. Eberhard Becker, Dirk Gunnewig, Willms Buhse, Niels Rump), in the series Lecture Notes in Computer Science (Springer Verlag, 2003). References [1] Rust, Godfrey. "Metadata: The Right Approach. An Integrated Model for Descriptive and Rights Metadata in E-commerce". D-Lib Magazine, Volume 4, Number 7/8, July/August 1998, ; <doi:10.1045/july98-rust>. [2] Biddle, P; England, P; Peinado, M; Willman, B. "The Darknet and the Future of Content Distribution". DRM 2002, <http://crypto.stanford.edu/DRM2002/darknet5.doc>. [3] <indecs> Web Site, <http://www.indecs.org>. [4] Sowa J F. "Knowledge Representation: Logical, Philosophical and Computational Foundations". Brooks/Cole, 2000. [5] SC29/WG11 N 4333: MPEG21 Technical Report TR 21000-1:2001 (2001-07-20), Information technology—Multimedia framework (MPEG-21) - Part 1: Vision, Technologies and Strategy. <http://www.nlc-bnc.ca/iso/tc46sc9/mpeg21/wg11n4333.pdf>. [6] Kahn, Robert E and Wilensky, R. A Framework for Distributed Digital Object Services, 1995, <http://www.cnri.reston.va.us/home/cstr/arch/k-w.html>. [7] IFLA: "IFLA Study Group on the Functional Requirements for Bibliographic Records—Functional Requirements for Bibliographic Records", 1998, <http://www.ifla.org/VII/s13/frbr/frbr.pdf>. [8] Rust, Godfrey, and Bide, Mark. "The <indecs> Metadata Framework: Principles, model and data dictionary", 2000, <http://www.indecs.org/pdf/framework.pdf>. [9] Kemerling, G: "A Dictionary of Philosophical Terms and Names"; www.philosophypages.com; February 2002, <http://www.philosophypages.com/dy>. [10] Wittgenstein, Ludwig "Tractatus Logico-Philosphicus, 5.5303", The Internet Encyclopedia of Philosophy, 2001, <http://www.utm.edu/research/iep/w/wittgens.htm>. [11] Guarino, Nicola and Welty, Christopher. "Identity, Unity, and Individuality: Towards a Formal Toolkit for Ontological Analysis"; Proceedings of ECAI-2000: The European Conference on Artificial Intelligence, IOS Press; Amsterdam, August 2000. [12] Popper, Karl R: Objective Knowledge: An Evolutionary Approach, (1972) Oxford University Press. [13] International Standard Book Numbering - ISBN- ISO 2108:1992, <http://www.nlc-bnc.ca/iso/tc46sc9/standard/2108e.htm>. [14] NISO Draft Standard - Book Item and Component Identifier, NISO Press, Aug 2000, <http://www.niso.org/pdfs/BICI-DS.pdf>. [15] Anderson Consulting: "A Bright Future for eBook Publishing: Facilitated Open Standards", AAP Annual Meeting, 22 March 2000, <http://www.publishers.org/digital/dec2000anderson.ppt>. [16] Open eBook Forum: "Open eBook Publication Structure Specification FAQ", August 2002, <http://www.openebook.org/oebps/oebps_faq.htm>. [17] ISO: "Frequently Asked Questions about changes to the ISBN", November 2002, <http://www.nlc-bnc.ca/iso/tc46sc9/isbn.htm>. [18] Beit-Arie, Oren et al. "Linking to the Appropriate Copy: Report of a DOI-Based Prototype", D-Lib Magazine, Volume 7, Number 9, September 2001; , <doi: 10.1045/september2001-caplan>. [19] ISO/IEC JTC1/SC29/WG11 - Coding of Moving Pictures and Audio, MPEG-21 Part 6 - Rights Data Dictionary, <http://mpeg.telecomitalialab.com/standards/mpeg-21/mpeg-21.htm>. [20] International Standard Text Code - ISTC: Draft ISO 21047, <http://www.nlc-bnc.ca/iso/tc46sc9/wg3.htm>. (On September 11, 2003, minor corrections were made to the HTML code and to two of the references in this article.) Copyright © Norman Paskin

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Editorial \| Next Article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions DOI: 10.1045/january2003-paskin