Safeguarding Digital Library Contents and Users
A Note on Universal Unique Identifiers
Henry M. Gladney
IBM Almaden Research Center
San Jose, California 95120-6099
To protect intellectual property, we must unambiguously identify each item needing protection.
This seemingly simple requirement has attracted much attention recently, which surprises many software engineers because to them the essentials have long seemed obvious. One may question whether they have failed to understand complexities and issues that fuel the discussion. We argue that what has seemed simple is in fact truly simple.
Object identifiers figure in kernel software, which takes several years to write, test, and refine. This note communicates that the topic can be separated into two portions: one which figures in discussions that need to continue, and a second portion which is sufficiently understood for building durable kernel software.
If something important is written about a digital object or a real-world thing, readers should reliably know which thing is intended. Unambiguous, enduring object identification is essential to safeguarding intellectual property.
Natural language, like English or French, distinguishes "identifier" from "name" -- although sometimes the former is a synonym for the latter. In careful computer science usage, each valid identifier denotes exactly one object within some context. For example, the string "ISSN 1082-9873" unambiguously denotes the D-Lib Magazine, given that in the context of a scholarly article we conventionally avoid using strings of this syntax for anything other than International Serial Sequence Numbers. In contrast, a name might denote more than one entity. (How many New Yorkers are called "Jane Smith"?) In addition, names are intended to be convenient for human use, a feature often not shared by identifiers, which we want primarily for use in machines.
Note that "context" in the prior paragraph is intended to be a precise, technical term; the Random House Dictionary (1996) defines context as the set of circumstances or facts that surround a particular word. Internet URIs, URNs, and URLs intend contexts to be Internet locations. The Oxford English Dictionary is another context, as is a ghetto argot -- where the "argot" could be pidgin, Yiddish, or C++. For digital systems, contexts are typically particular directories, such as those within name servers.
Many software engineers are surprised at how much object identifiers are discussed; for example, they are prominent in a recent workshop on rights management. We are surprised not only because we have long understood what is needed and how to accomplish it, but also because workable procedures exist in well-known common practice (for drivers' licenses, bank account numbers, and so on). Accordingly, the current note captures and condenses essentials written in many places.
What's Seemingly Simple is Truly Simple
In work directed at Uniform Resource Names (URNs), Sollins and Mastinter [Sollins and Masinter] detail Universal Unique Identifier (UUID) requirements suitable for data processing and communications. As summarized in a story for D-Lib Magazine, this means: global scope and uniqueness, persistence, scalablity and extensibility, independence for name-issuing authorities, and as much legacy compatibility as the other requirements allow [URN implementors].
The literature and practical experience teach that:
- We should think of data object identifiers as we do about material object identifiers. In fact, it is helpful to think of them as within the same space. For example, a real estate contract is the same sort of thing as a license to perform some popular song. This example suggests four identifiers -- for a real estate parcel, for a song, and for each of two contracts.
- To avoid ambiguity, each UUID must be unique at any moment and for all time. It should continue to be tabulated indefinitely in some context, e.g., catalog, to which it points and which points to it and which either contains or points to the most important descriptors of the item.
- A global UUID system should permit different people to choose identifiers without communicating whenever a new one is wanted.
There are three ways to accomplish this:
- have a central ID-issuing agent;
- have several ID-issuing agencies which among themselves have negotiated that collisions do not occur;
- choose each ID independently in a way making accidental collisions extremely unlikely, e.g., combine a time stamp precise to 1 second, and latitude/longitude precise to 10 meters.
The third alternative is appealing, but impractical today; it requires reliable clocks and global position sensors embedded in computers, which is likely within a decade. A central ID-issuing agency is impractical both in theory and because society already chose option (2) as usual practice years ago. Whoever needs a UUID gets it from one of many ID-issuing agencies. Each of these ensures that each identifier given out is different from that of any other agency by prepending it with an agency identifier which has been (possibly implicitly) negotiated to be different from that of any other agency.
That this is common practice is illustrated by familiar examples: ISBN 0-12345-678-X; U.S. Patent Number 5,123,456; SSN 123-45-6789; California Driver's License Q123 4567. A machine or a person should be able to find whatever any identifier identifies, i.e., there needs to be a system, machine, person, or organization to resolve each kind of identifier into an access path to an object. It is sometimes also useful and possible to determine the identifier from (a copy of) the thing identified.
Identifiers can be represented by numbers, character strings, or bit strings. These are equivalent if mapping is done in well-known ways. The only question is how long such strings should be. The practical choices are multiples of 8 bits; it helps to use only multiples of 32 bits, because many program language compilers use 32 bits as a word length and 64 bits as a double word length. 32 bits is too short, as it allows only about 4.3 billion different strings. 64 bits allows more than 1019 different strings, which is probably enough. (Now that the year-2000 problem is so well known, it is unlikely that people will quarrel with allowing this much space wherever a digital system needs to store or convey a UUID.) We must choose how many bits are the agency ID and how many remain as the agency's choice; 32 bits and 32 bits is probably workable. However, the current way of choosing Internet IDs shows that variable schemes are feasible and have some advantage.
A UUID must be transmitted among systems as an untranslated bit-stream. Its printed form will depend on the character-set conversion chosen, i.e., will often be different in different environments.
Some people want to encode mnemonics within their (favorite style of) UUIDs. There is little reason for anyone else to object. Those who want encoding should recognize that it usually reduces the number of accessible IDs within their agency's ID space. Their favorite encoding is not likely to be as compellingly valuable in 100 years as it seems today. ID-issuing agencies might choose to map human-convenient identifiers into machine-convenient ones to exploit the advantages of both forms.
Nothing more needs to be said or written, apart from two extraneous comments.
We've dealt with identifiers to be used by machines -- identifiers which only incidentally are seen or transcribed by human beings. Digital machinery usually includes hidden redundant bits to prevent errors. When machine identifiers are externalized for human transcription, they should include error-detecting or, better, error-correcting information. This could be check digits appended to each external representation; a standard for such check digits is needed. Last week, Norman Paskin pointed out that such a standard exists (ISO 7064:1983); however neither he nor I had a copy available to ascertain whether it is used in such applications as SICI (Serial Item and Contribution Identifier -- ANSI Z39.56 Version 2 (1996)).
In application systems, we should be careful not to allow presentation of an identifier to support retrieval of too much information. This is illustrated by patient identifiers in health care systems, for which the hazards of a "longitudinal patient record" are eloquently argued by [Wägemann].
Some Questions Reconsidered
Lynch [Lynch] collects questions suitable for any specific identifier system. We sketch a beginning of answers, as concisely as possible.
- What is the scope of the identifier system -- what kinds of objects can be identified with it? Who is permitted to assign identifiers, and how are these organizations identified, registered, and validated?
There need be no restriction about the types of objects identified as long as types can be ascertained when the objects, or meta-data which can be found from the identifiers, are inspected. An ID-issuing agency makes itself and its identifier-resolving service known by publication. The public validates it by discovering whether it claims meeting the requirements set forth by Sollins and Masinter [Sollins and Masinter], and whether, in practice, the claims are met.
- What are the rules for assigning new identifiers; when are two instances of a work the same (that is, assigned the same identifier) within the system, and under what criteria are they considered distinct (that is, assigned different identifiers)? What communities benefit from distinctions that are implied by the assignment of identifiers?
These questions have little to do with identifiers, being instead directed at the existence and meaning of the objects identified. The considerations are much the same for information objects as they are for material objects. (The material within a human being changes more or less continuously, without our thinking of the person as a different being. The parts of an automobile are replaced when needed; some subassemblies get serial numbers for separate identification [engines, audio subsystems], others do not.) The circumstance under which an object should be relabeled as a distinct object, or as a version, is a human decision based on what is intended. These changes may be subject to rules, legal or otherwise, but neither the rules nor the practice of identifying derivatives as new objects has much to do with the identifier system.
- How does one determine the identifier for the work, and can one derive it from the work itself, or does one need to consult some possibly proprietary database maintained by a third party? To what class of objects are the identifiers applicable? Within this class of objects, is there an automatic method of constructing identifiers under the identifier system, or does someone have to make a specific decision to assign an identifier to an object? If so, who makes this decision, and why? Note that, if the identifier cannot be derived from the identified work, it is unsuitable for use as a primary identifier within any system of open citation. The act of reference should not rely upon proprietary databases or services.
Again, these questions have next to nothing to do with identifiers as needed to ensure unambiguous data processing and communication, or the design of computing system kernels, or intellectual property protection. Instead they have to do with means for finding objects from their identifiers. The notion that an identifier should be derivable from the object identified is controversial. For "large" objects that easily become detached from reliable catalogs, it is convenient and common to embed a representation of the identifier within the object (ISBNs, automobile serial numbers), but doing this is neither possible nor desirable for all objects. (For instance, it is done to race horses, but not to all horses.)
- How is the identifier resolved -- that is, how does one go from the identifier to the identified work, or to other identifiers or metadata to permit the instances of the work to be located and accessed? Again, what is the role of possibly proprietary third party databases in resolving the identifier? Do the operator or operators of these resolution services have monopoly control over resolution? What are the barriers to entry for new resolution services? What are the policies of the resolution services in areas such as user privacy and statistics gathering?
Each identifier must have at least one method of finding its referent (otherwise it is called a "dangling reference".) Beyond that, the first question leads to further questions about system performance, which is often aided by redundant access paths; these introduce well-known integrity problems with well-known solutions. The other questions implicitly object not to any scheme for identifiers, but instead to processing systems which require identifiers drawn from some particular exclusive or potentially expensive ID-issuing organization.
- How persistent is the identifier across time? Can one still resolve it after the work ceases to be commercially marketed? Identifiers that rely on the state of the commercial marketplace are very treacherous for constructing citations or other references that can serve the long-term social or scholarly record.
Here, Lynch [Lynch] exposes the tip of a very large iceberg -- the longevity of intellectual content and responsibility for its maintenance, an issue needing its own careful discussion. However, as long as the identifiers used meet the criteria put forward by Sollins and Masinter [Sollins and Masinter], the issues are not ones of identifier systems or even of the implementing technology. (To appreciate the challenges of 100-year archiving, see Garrett et al. [Garrett et al.].)
An Open Question -- Identifier Lengths
My IBM colleague Jeff Lotspiech has raised a question that is not yet answered but deserves attention before anyone builds more identifier-using kernel software.
Above we have assumed that fixed-length identifiers would be adequate; Lotspiech recommends working with variable, unbounded-length identifiers even in computing system kernels. Bitter experience suggests that even careful choice of a length bound risks unanticipated and unwelcome restrictions years after the choice.
A hurdle is that unbounded variable length causes space and processing overhead which might be intolerable in kernel software. Before we build a proposed access control prototype, we will carefully consider design for variable length identifiers.
Our objective has been to partition the topic of digital object identifiers so that a fundamental portion can be fixed in the very near future sufficiently for getting on with the business of constructing and testing core software. We believe that identifiers and systems that handle them are sufficiently understood for us to develop the kernels of digital computing and communication systems with low risk of future breakage.
The public issues alluded to above and carefully described by Paskin [Paskin] have less to do with identifiers than with aspects of using them -- for efficient and fast object retrieval, in the concerns about agencies which monopolize generating widely-needed identifiers, in archiving the intellectual heritage and critical operational information, and in semantics of how objects are related to each other. To some extent, such issues have been mislabeled as issues of identifiers.
AcknowledgementsCritical reading by William Y. Arms, Sebastian Gladney, Jeff Lotspiech, and Norman Paskin helped create an exposition which I hope is concise, complete, and intelligible.
URN implementors], Uniform Resource Names, D-Lib Magazine, (February 1996).
[Garrett et al.] J. Garrett, D. Waters, P.Q.C. Andre, H.Besser, N. Elkington, H.M. Gladney, M. Hedstrom, P.B. Hirtle, K. Hunter, R. Kelly, D. Kresh, M.E. Lesk, M.B. Levering, W. Lougee, C. Lynch, C. Mandel, S.B. Mooney, A. Okerson, J.G. Neal, S. Rosenblatt, and S. Weibel, Preserving Digital Information: Report of the Task Force on Archiving of Digital Information for the Commission on Preservation and Access and the Research Libraries Group, (1 May 1996).
[Klavans] Judith Klavans et al., Workshop on Rights Management: Workshop Summary Jointly sponsored by the National Science Foundation and the Digital Library Federation.
[Lynch] C. Lynch, Identifiers and Their Role in Networked Information Applications, (1998).
[Paskin] Norman Paskin, Digital Information Objects and the STM Publisher, STM Annual Report, (1997) summarizes many object identifier discussions. His earlier Information Identifiers, Learned Publishing 10(2), 135-156, (April 1997) links much of the pertinent literature.
[Sollins and Masinter] K. Sollins and L. Masinter, Functional Requirements for Uniform Resource Names, Internet Engineering Task Force RFC 1737, (December 1994).
[Wägemann] C.P. Wägemann, Patient Identifiers: Religious Dogma, Passion, and Misconception, Toward an Electronic Patient Record, Conference Proceedings v.3, 53-5, (1997).
Copyright and Disclaimer Notice
© Copyright IBM Corp. 1998. All Rights Reserved. Copies may be printed and distributed, provided that no changes are made to the content, that the entire document including the attribution header and this copyright notice is printed or distributed, and that this is done free of charge. We have written for the usual reasons of scholarly communication. Wherever this report alludes to technologies in early phases of definition and development, the information it provides is strictly on an as-is basis, without express or implied warranty of any kind, and without express or implied commitment to implement anything described or alluded to or provide any product or service. Use of the information in this report is at the reader's own risk. Intellectual property management is fraught with policy, legal, and economic issues. Nothing in this report should be construed as an adoption by IBM of any policy position or recommendation.
The opinions expressed are those of the author, and should not be construed to represent or predict any IBM position or commitment.
Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor