Text is more than just words on a page
This month, D-Lib Magazine inaugurates our occasional series of guest editorials by distinguished members of the research community. Susan Hockey, a classicist with a 25-year career in humanities computing, writes for us on text encoding and digital library issues.
From the earliest times back to when writing systems were first developed, text has been the major tool for communicating information. Much of the material that exists in our libraries today is text. We use text to disseminate the results of scholarly research, to create new literary works as well as to communicate with each other socially. At present almost all text is written with the assumption that it will appear on paper and be read on paper. Most of the electronic text in use today is either a surrogate of something which already exists in another form, usually paper, or it is put into electronic form for word-processing or typesetting in order to produce something which will ultimately appear on paper. Many of the functions which are performed on it are the same as those which might be performed on traditional forms - reading, browsing etc.
However, the electronic medium offers much more potential for working with text. It can be searched, analyzed and manipulated in many different ways. How best can we exploit this? How can we create an electronic text which can be used for all these purposes and still be printed or displayed easily for reading? And what in the recent experience of humanities scholars who have worked extensively with electronic texts will contribute to the new world of networked information? For the last forty-five years humanities scholars have created electronic representation of scholarly texts in order to make concordances or to carry out stylistic and other types of analysis. Almost all early electronic text conversion projects began by creating an electronic text which attempted to reproduce exactly what was on the page. They faithfully copied typographic features such as italics, page layout centering etc., but then found that these features are ambiguous and limited the kinds of analyses which could be performed. Something in italic can be a title, a foreign word or an emphasized word. Something centered could be a page heading or a section heading or even a stage direction in a play.
Typographic features are one form of encoding or markup, in this case intended to help the reader by reinforcing what the text says. It took a long time to realize that for computer processing markup needs to be more specific in some ways and more general in others. Markup makes explicit features which are implicit for the human reader. Exactly what those features are depends on who is creating the electronic text and what it might be used for. They may be paragraphs, lists, abbreviations, names, dates, or even linguistic and literary interpretations. Since creating an electronic text is also an expensive process, it makes sense to insert markup which is not specific to one purpose, but can allow the same text to be used for display or retrieval or as part of a hypertext system. It also makes sense for the markup to be hardware and software independent, ensuring that the text can migrate easily from one platform to another. The Standard Generalized Markup Language (SGML) was developed to meet this need and became an international standard in 1986.
SGML is built on the principle that you do not say what you want to do with the components or structure of an electronic text. You merely say what they are: title, heading, name, list, paragraph etc. Different application programs can then do different things with the same text: search all titles, print titles in italic, make titles "hot text" in a hypertext. In SGML it is up to the document designer to determine what components make up the structure of a text. These components or elements as they are technically called can be absolutely anything, even analytic or interpretive information. The set of elements defined for one particular type of text is called an SGML application.
Hypertext Markup Language (HTML) is probably the SGML application with the broadest familiarity, given the ubiquity of the World Wide Web. HTML is most-suited to the display of text and has propagated the concept of structured text more widely. But despite its comparatively recent popularity and continued evolution, HTML is not the only or the most sophisticated application of SGML. Indeed, the humanities computing community was one of the earliest to adopt SGML. The three major text analysis computing associations began a project in 1988 to create an SGML application called the Text Encoding Initiative (TEI) for encoding scholarly texts in the humanities. SGML was chosen because it is much better than any other markup scheme for representing the complexities of humanities texts which may include several layers of footnotes, variant spellings, and different structures as are found in verse and drama, as well as non-standard characters.
A number of important issues can be noted in the context of creating and working with electronic texts. Among these are metadata, the suitability of optical character recognition, encoding as an interpretation, and new forms of writing. The TEI was one of the earliest encoding schemes to acknowledge the need for metadata for electronic texts. In print form conventions exist for a text's metadata in the form of the title page, cataloging in publication data etc. It would be possible to put this information into an electronic text file, but if it was not encoded in some way it would be treated as part of the text - leading perhaps to the copyright notice being searched as well as the text. For electronic texts, the metadata needs to contain bibliographic information about the text, but to be really useful it ought also to contain information about the properties of the electronic text: what features are encoded within it, who created it, what changes have been made to it? There is little provision for recording this information in the Anglo-American Cataloging Rules Revision 2 (AACR2) where the chapter on computer files deals with numeric data files and computer programs and has much emphasis on the physical characteristics of the computer file and little about how the information is represented within that file. The new field 856 in the standard for Machine Readable Cataloging (MARC) is really intended to locate electronic information on the network, rather than to give some properties of the electronic information.
For creating electronic texts, much hope has been held out for optical character recognition, but there has been little improvement in readily available OCR software in the last ten years. It is possible to get about 99.9% accuracy with a good OCR program on a very clean text but that still leaves one error every 1000 characters, or about every 10-12 lines. While this may be adequate for the more conventional type of retrieval system which finds all documents on a particular topic, it would not satisfy many humanities projects which need an accurate text in order to answer questions such as does this word ever occur in the text. Furthermore OCR creates a typographic representation of a text which, as we have seen, is not very suitable for computer analysis.
Creating an electronic text is an act of interpretation. Decisions must be made on how to represent features in the original. Only very rarely is it possible to represent absolutely every feature of the original text and in many cases this may not be necessary. Many scholars who have created electronic texts for their own use inevitably concentrated on representing those features in which they are interested. But some features need to be encoded for an electronic text to be processed in any meaningful way. For example if the text is in different languages, they need to be encoded, otherwise interesting confusions such as English "pain" and French "bread" occur. Names really need to be encoded, and subjected to standardized spellings, probably distinguishing personal names and place names, so that the person Madison and the town Madison are not confused. Dates are important in historical documents and they may not necessarily fit into modern western calendar systems, one of my favorites being "in the reign of Nero", a genuine date I encountered in one of the first humanities computing projects I worked on.
Writing specifically for the electronic medium is still considered a new type of literature - best illustrated by the hypertext fictions created by Michael Joyce, Stuart Moulthrop and others. The story that has no beginning, middle and end but simply allows you to delve into and follow whatever thread you want to may be the very beginning of a new way of expressing ourselves in writing. Ever since writing began, its form has been determined by the physical medium on which it is placed. Since the invention of printing, this has been rectangular pieces of paper bound together which has encouraged us to write in a single linear stream. However much scholarly writing is hypertextual in form and can be better expressed in a non-linear format. It contains cross-references to notes, bibliography and other works. Conventionally notes are at the bottom of the page or the end of an article or chapter. In electronic form they are available at the click of a mouse and they can be searched separately from the main text if desired. In a network environment a cross-reference to an item in the bibliography should lead to the item itself not just a reference for it - we have already seen how this can happen in simple way in the World Wide Web (WWW).
Beyond that we can begin to think of hypertextual writing of scholarly papers where differing arguments or interpretations are presented in parallel as hypertext links. The WWW offers a glimpse of the potential of a global network of linked resources and linking mechanisms are likely to become more and more important. They are fundamental to scholarship in any discipline which is about making connections between items of information and associating some intellectual rationale with those connections. The scholarly article or monograph of the future may well consist of links to other globs of information stored elsewhere on the network with some electronic text that explains the reasons for those links. Setting up the infrastructure to support this could be a major challenge for digital library research in the 21st century.
Director, Center for Electronic Texts in the Humanities
Rutgers and Princeton Universities