C. Montgomery Burns: "I'd like to send this letter to the Prussian consulate in Siam by aeromail. Am I too late for the 4:30 autogyro?"
From "Mother Simpson," The Simpsons Television Show, Episode 3F06
Consider a sentence such as "the current price of tea in China is 35 cents per pound." In a library with millions of books we might find many statements of the above form that we could capture today with relatively simple rules: rather than pursuing every variation of a statement, programs can wait, like predators at a water hole, for their informational prey to reappear in a standard linguistic pattern. We can make inferences from sentences such as "NAME1 born at NAME2 in DATE" that NAME1 more likely than not represents a person and NAME2 a place and then convert the statement into a proposition about a person born at a given place and time. The changing price of tea in China, pedestrian birth and death dates, or other basic statements may not be truth and beauty in the Phaedrus, but a digital library that could plot the prices of various commodities in different markets over time, plot the various lifetimes of individuals, or extract and classify many events would be very useful. Services such as the Syllabus Finder1 and H-Bot2 (which Dan Cohen describes elsewhere in this issue of D-Lib) represent examples of information extraction already in use.3 H-Bot, in particular, builds on our evolving ability to extract information from very large corpora such as the billions of web pages available through the Google API.
Aside from identifying higher order statements, however, users also want to search and browse named entities: they want to read about "C. P. E. Bach" rather than his father "Johann Sebastian" or about "Cambridge, Maryland", without hearing about "Cambridge, Massachusetts", Cambridge in the UK or any of the other Cambridges scattered around the world. Named entity identification is a well-established area with an ongoing literature. The Natural Language Processing Research Group at the University of Sheffield has developed its open source Generalized Architecture for Text Engineering (GATE4) for years, while IBM's Unstructured Information Analysis and Search (UIMA) is "available as open source software to provide a common foundation for industry and academia."5 Powerful tools are thus freely available and more demanding users can draw upon published literature to develop their own systems. Major search engines such as Google and Yahoo also integrate increasingly sophisticated tools to categorize and identify places. The software resources are rich and expanding.
The reference works on which these systems depend, however, are ill-suited for historical analysis. First, simple gazetteers and similar authority lists quickly grow too big for useful information extraction. They provide us with potential entities against which to match textual references, but existing electronic reference works assume that human readers can use their knowledge of geography and of the immediate context to pick the right Boston from the Bostons in the Getty Thesaurus of Geographic Names (TGN)6, but, with the crucial exception of geographic location, the TGN records do not provide any machine readable clues: we cannot tell which Bostons are large or small. If we are analyzing a document published in 1818, we cannot filter out those places that did not yet exist or that had different names: "Jefferson Davis" is not the name of a parish in Louisiana (tgn,2000880) or a county in Mississippi (tgn,2001118) until after the Civil War.
Although the Alexandria Digital Library provides far richer data than the TGN (5.9 vs. 1.3 million names),7 its added size lowers, rather than increases, the accuracy of most geographic name identification systems for historical documents: most of the extra 4.6 million names cover low frequency entities that rarely occur in any particular corpus. The TGN is sufficiently comprehensive to provide quite enough noise: we find place names that are used over and over (there are almost one hundred Washingtons) and semantically ambiguous (e.g., is Washington a person or a place?). Comprehensive knowledge sources emphasize recall but lower precision. We need data with which to determine which "Tribune" or "John Brown" a particular passage denotes.
Secondly and paradoxically, our reference works may not be comprehensive enough. Human actors come and go over time. Organizations appear and vanish. Even places can change their names or vanish. The TGN does associate the obsolete name Siam with the nation of Thailand (tgn,1000142) but also with towns named Siam in Iowa (tgn,2035651), Tennessee (tgn,2101519), and Ohio (tgn,2662003). Prussia appears but as a general region (tgn,7016786), with no indication when or if it was a sovereign nation. And if places do point to the same object over time, that object may have very different significance over time: in the foundational works of Western historiography, Herodotus reminds us that the great cities of the past may be small today, and the small cities of today great tomorrow (Hdt. 1.5), while Thucydides stresses that we cannot estimate the past significance of a place by its appearance today (Thuc. 1.10). In other words, we need to know the population figures for the various Washingtons in 1870 if we are analyzing documents from 1870.
The foundations have been laid for reference works that provide machine actionable information about entities at particular times in history. The Alexandria Digital Library Gazetteer Content Standard8 represents a sophisticated framework with which to create such resources: places can be associated with temporal information about their foundation (e.g., Washington, DC, founded on 16 July 1790), changes in names for the same location (e.g., Saint Petersburg to Leningrad and back again), population figures at various times and similar historically contingent data.
But if we have the software and the data structures, we do not yet have substantial amounts of historical content such as plentiful digital gazetteers, encyclopedias, lexica, grammars and other reference works to illustrate many periods and, even if we do, those resources may not be in a useful form: raw OCR output of a complex lexicon or gazetteer may have so many errors and have captured so little of the underlying structure that the digital resource is useless as a knowledge base. Put another way, human beings are still much better at reading and interpreting the contents of page images than machines.
While people, places, and dates are probably the most important core entities, we will find a growing set of objects that we need to identify and track across collections, and each of these categories of objects will require its own knowledge sources. The following section enumerates and briefly describes some existing categories of documents that we need to mine for knowledge. This brief survey focuses on the format of print sources (e.g., highly structured textual "database" vs. unstructured text) to illustrate some of the challenges involved in converting our published knowledge into semantically annotated, machine actionable form.
Digital Reference Materials
Carefully transcribed primary sources
Most users do not see source texts as comparable to reference works such as gazetteers and encyclopedias, but well edited digital primary sources are crucial tools in any large digital library. Each canonical chunk of a text combines a unique identifier with a textual key. Thus the text in Vergil, Aeneid, book 3:
<l n="22">Forte fuit iuxta tumulus, quo cornea summo</l>
defines a seven word sequence as the content of Verg. A. 3.22. A Google search currently retrieves 170 pages that contain this line. We can then automatically identify all 170 instances of the same canonical chunk of text. These pages can include other editions, excerpts in scholarly articles, epigraphs at the heads of chapters, and collections of quotations to name only a few.
Such alignment requires flexible searching, since different texts will vary from one another, exhibiting different choices of vocabulary, punctuation, even spelling conventions, and these differences will sometimes make retrieval impractical. Collocations of even a few words are, however, typically so unusual that most individual lines of poetry will provide effective queries when searching very large corpora, and any substantial paragraph of prose will constitute a virtually unique key. The same techniques by which we track plagiarism will allow us to align editions and track the influence of a given text over time as it is quoted either verbatim or in part.
Classicists showed great wisdom in establishing and maintaining authoritative citation schemes for most of their primary sources well before the twentieth century. Automatic alignment may prove even more useful where disciplines have redefined the citation schemes. (Publishers often have seen such changes as good for business, since students and their professors need to have up-to-date texts much to the frustration of professors with notes geared to the old editions.) Commentary A may be keyed to the Riverside edition of Hamlet, while commentary B may be keyed to the older Globe Shakespeare: by aligning Globe and Riverside editions of Shakespeare, we can align entries in the previously incommensurate commentaries.
If we combine carefully marked-up source texts with cross-language information retrieval, we should be able to find virtually all translations of any text or substantial text excerpts in languages for which cross language information retrieval is available a set of languages that includes the major scholarly languages included in our research libraries.
Thus, one marked-up edition of a major canonical text in a large digital library can, with standard techniques available today, provide the structure to align dozens of editions, hundreds of translations and quotations in thousands of secondary sources.
Gazetteers and semi-structured text sources
The most commonly cited gazetteers list place names and provide geographic information, but the term can be applied to many reference works that contain regularly structured articles that more closely resemble database records than running prose.
Please consider the following excerpt that comes from a ninteenth century gazetteer.
<div 2 type=entry><head>AARONSBURG</head>
The underlying data lends itself well to propositional representation but language is cryptic: short abbreviations whose meaning depends upon context (e.g., "co." as "county" vs. "company," "m." here is "miles" but elsewhere "mill") and other idiosyncrasies (e.g., "N. W" = "northwest" but "181 W." = "181 miles from Washington, DC") make extraction tricky. Most of the structure can be extracted:
<div2 type=entry id="aaronsburg-is-within-centre-county-is-within-
The statement "181 W." in this case has not been recognized as "181 miles from Washington," but the other major abbreviations have been expanded and the propositional information identified. The above reflects automated analysis designed for this particular reference work as a major knowledge source for nineteenth-century American knowledge of geography with a fairly regular structure, Harper's warrants specialized work. More general routines will subsequently attempt to analyze place names such as "Penn's valley" and "Bellefonte."
In this case, we were able to align this entry against the TGN. Of the 78,202 entries currently identified in this work, only 29,180 (37.3%) map to unique TGN identification numbers. In 45,484 instances (58.16%) we can at present find no corresponding entry in the TGN. The remaining 3,538 (4.5%) are ambiguous and could correspond to more than one TGN entry. One-to-one alignments between Harper's and the TGN are important because they not only attest to the existence of a place 150 years ago, but we can also get information about its mid nineteenth-century population, the region of which it was a part, and other features: knowing that Bellefonte is a point of reference for this Aaronsburg can help us identify each city. Given a reference to Bellefonte, we can increase the probability that a nearby Aaronsburg is the town in Pennsylvania. Places that we cannot align with the TGN are even more important because they alert us that something has changed: the place no longer exists, its name has changed, or something else prevents us from recognizing it with a modern knowledge base.
Many historical reference works are the products of index card data and anticipate databases. Consider an example from a much simpler, but strategic, reference work, George P. Rowell's 1869 American Newspaper Directory.
<div4 type=place><head>ATHENS Post</head>
The structure is fairly straightforward and can be made more explicit.
The main statements have been broken out into separate propositions. The entry comes from the Alabama section, and we can thus automatically look for Athens, Alabama, and assign this with its TGN number. The title is separated out from the place name so that we can recognize "the Post" as a shorter designation of this newspaper.
Structured reference works similar to this may include bibliographic data, battles, military commands, lists of dates and events, rosters of soldiers, public officials, products. Often such data appears embedded in larger works. Harper's 1902 Encyclopedia of US History, for example, has timelines for the US and the different states with more than 10,000 entries. The common principle is that each of these textual resources lends itself to being parsed into more structured data records with little, if any, raw text left uncaptured.
Citation-based authority lists
Book indices are the most common examples of citation based authority lists and are extremely valuable knowledge sources. The early modern Latin language digital libraries MATTEO and CAMENA,10 developed in Germany and described elsewhere in this issue of D-Lib, spawned a third collection, TERMINI,11 which is extracting and organizing into a single database data from indices, marginalia, notes, and other tools added for human readers. These indices include explanatory information about their entries (e.g., full names, dates, locations) and the contents as a whole constitute models of the semantic field implicit in each document.
The following illustrates an index where entries from three individual author indices have been aligned:
<div1 type="entry" id="abdera">
The excerpt above has defined two unique entities. The first (
The following excerpt comes from the index to the four-volume collection of articles Battles and Leaders of the Civil War.
The personal name Silas Adams has been correctly divided into forename and surname. Similarly, the affiliation (Adams as a Union, rather than a Confederate soldier) and the military unit "1st Ky. Cavalry" have been recognized, as has the location of a citation (vol. 4, page 416). The abbreviation "col." has not been recognized because accepting lower case "col." as an abbreviation of "colonel" without a following potential name (e.g., "col. Smith") has proven too ambiguous as a general rule. By modifying the general rules for the above index, we can capture not only the individual element "col. = colonel" but also then infer a higher order proposition that Silas Adams was the Colonel of the 1st Kentucky Cavalry.
Indices to proper names in authors or books vary widely in coverage. The index of Apollodorus records almost every proper name, while the Herodotus index captures approximately 30% and the Battles and Leaders index approximately 25%. Nevertheless, even partial indices provide evidence of relative frequency. Preliminary analysis on the above texts suggests that if we know what the most frequent entity is (e.g., "Boston, MA" is more common than "Boston, UK"), then simply assigning each ambiguous name to its most common representative gives us an overall accuracy of 91% to 97%.
Machine readable dictionaries
Machine readable dictionaries are sufficiently well documented to have their own acronyms (MRD). They are fundamental resources for cross-language information retrieval, machine translation, and a range of analytical tasks. They can contain immense amounts of data, tightly packed. Consider a typical entry from the Liddell Scott Jones Greek-English Lexicon:
<entryFree id="n3709" key="a)krwth/rion" type="main"><orth
This lexicon contains 100,000 entries and 1,000,000 source citations. When the citations are tagged and linked to the primary sources, the lexicon becomes a virtual commentary, with specific comments on one word in every seventeen in the texts. This Greek-English lexicon is probably the most developed single XML document in the Perseus Digital Library. Note that it follows a very precise bibliographic scheme, where we define Herodotus as an "abstract bibliographic object," linked to the dominant authority list for Greek literature, the Thesaurus Linguae Graecae as author number 16 with his History as work number 1.
We originally entered our Greek and Latin lexica less for their definitions than for the morphological data. Thus, in the above example, <orth extent="full" lang="greek">a)krwth/rion</orth>, <gen lang="greek">to/</gen> allows us to infer morphological information so that we can recognize akrotêriou and akrotêriois as forms of the same word.
Comprehensive digital libraries will contain many different lexica, monolingual and bilingual, designed for various purposes. Canonical authors often have their own specialized lexica with extensive coverage. The following excerpt is drawn from Alexander Schmidt's Shakespeare Lexicon, in which we have identified 233,402 annotations on particular words (specific comments on about one in four of the 900,000 words in the Globe Shakespeare on which the lexicon is based).
<entryFree key="Abhorring" type="main">
Other lexica may be organized by theme. Knight's Mechanical Encyclopedia (1877) documents contemporary technology:
<div2 n="Actinometer" id="actinometer" type="entry">
This opening of the above entry includes not only a name and brief description but also who invented this device and when. The named entity system has captured the name and the date, laying the foundation for extraction of the factoid that Herschel invented the actinometer in 1825.
Smith's (1890) Dictionary of Greek and Roman Antiquities provides a multilingual survey of words relevant to the daily life and society of Greco-Roman antiquity:
<div2 type="entry" id="domus-cn">
This entry includes both the Latin term domus and various Greek equivalents. The subsequent article compares houses across time in both Greek and Roman cultures.
Many reference works contain entries of standard full text. These will often contain regularities that can be mined: e.g., most of the biographic entries in Harper's (1902) Encyclopedia of United States History contain phrases such as "born at X in Y." Even when such propositional data are not easily determined, by analyzing the people, places, organizations, and dates associated with particular entities we can build language models to improve retrieval and disambiguation. While the open source reference work Wikipedia, with nearly a million entries in late February 2006, is probably the best single such resource, it does not capture historical data that is embedded in earlier reference materials and is crucial for analyzing contemporary documents.
Massive, expensive digital libraries should allocate a measurable portion of their operational budgets to the careful transcription and analysis of printed reference materials. It may be impractical and unnecessary to have 10,000,000 books entered by conventional double-keyboarding methods. Reference materials do not, however, grow linearly along with collection sizes reference materials have traditionally been capital projects developed at great expense and/or over long periods of time. Suppose conversion of reference materials into well-structured knowledge sources costs 100 times more than the initial image book. If we create a reference library of 10,000 books for our 10,000,000 book digital library, we add only 10% to the overall cost of the project. The return on such an investment would be considerable: the knowledge sources would not only support better automatically generated metadata but also could improve the quality of every service, from OCR through machine translation to searching and browsing.
We need more publications designed for machine and as well as human readers. We need traditional specialties such as palaeography (the study of earlier forms of handwriting in manuscripts) and bibliography (the study of the form and history of books) to produce knowledge with which document recognition systems can understand more about the content, text, graphics and layout of our collections. We need grammars and lexica based on scientifically designed corpora and able to support machine translation, cross language information retrieval, syntactic analysis, and similar technologies. We need encyclopedias, gazetteers, indices, directories, and other reference works engineered from the start to improve named entity identification and information extraction. We may find machine readability emerge as an essential feature that characterizes serious publication.
We need more people who can apply research in computer and information science to their domains. Some domain specialists need to be able to do more than install and operate complex open source applications such as GAMERA, GATE, or LEMUR12 and the other tools of great utility but with much less documentation. An ability to write code is not enough. We need domain specialists who understand the relevant areas of computer science, can critically assess emerging techniques, and can identify instances where it is worth augmenting existing tools or building new tools. The humanities in particular need corpus editors humanists who can bridge the gap between ongoing research in computer and information science and the aspirations of their field, present and future.13 While we need computer scientists who press for the most general and scalable solutions and traditional editors who work through primary sources one word at a time, we need corpus editors who occupy an intermediate space, sacrificing generality for higher performance on their domains, while sacrificing review of every word and every decision for minimized error rates and statistical measures of accuracy so that they can work with collections too large for manual methods.
Computer scientists, internet powers such as Google, Yahoo, and MSN, academics and librarians must forge new relationships with broader communities to create and maintain the structured knowledge to manage and extend vast collections. Our automated processes will never be perfect. We need to merge reference works (e.g., the individual tagged as "Alexander-2" in work 1 = "Alexander-5" in work 2), and refine the results of automated document recognition, machine translation and information extraction. Our domain specialists will never be able to correct all the errors of our automated systems or complete the inevitably partial data that we could extract from existing reference works. We need new ways to distribute intellectual labor. Wikipedia and the distributed proof-reading system of Project Gutenberg illustrate the immense energies available from beyond the academy for sustained, substantive work. We need new partnerships between internet giants such as Google, Yahoo and Microsoft, academics, teachers, students, librarians and those whose passion for intellectual life is independent of their professional positions.
Massive digital libraries, embodying much of the published record of humanity, can provide the structured data that we need to build increasingly sophisticated services, that grow more able to provide us with the information that we need not only to increase our productivity but to learn and to grow intellectually. The articles by Wolfgang Schibel and Jeff Rydberg-Cox and by Dan Cohen illustrate how far-sighted humanists can use the resources at their disposal not only to serve the needs of the present but to build a foundation for the future as well. Sayeed Choudhury and his collaborators and David A. Smith document key elements of a infrastructure on which humanists will not only build but to which they will also need to contribute. On the one hand, we can model a workflow that carries us from collection development to digitization, from digitization through document recognition to conversion from one language to another, then from language to data, and finally to services on which researchers may build. At the same time, we can also see how all of these stages influence one another structured collections such as gazetteers and lexica can improve document recognition, collection design inspires added value service development, while research applications shape collection design. Each component affects every other.
Movable type and industrial print allowed us to reproduce writing with far greater speed, accuracy and at lower costs than hand-writing could accomplish. The quantitative effects are accelerated reproduction and dissemination, but the human mind remains unchanged. Writing was a qualitative shift, in that it allowed us to trace our ideas in a material form separate from our memories. Increasingly intelligent systems have begun to transfer modest, but increasingly powerful, processes: we can not only store information but we can also search and process it automatically as well, encoding practices in machine actionable form. The articles in this issue can only point in a general way to a much larger process, but one that has begun to change the intellectual life of our species in ways that we cannot fully imagine. But if the potential is daunting, we all have ways in which we can contribute.
3. For an overview of current work in information extraction and named entity extraction please consult our list for further reading at the end of this article.
9. Harper's Gazetteer of the World (New York 1855).
13. For a discussion of these please see, Crane, Gregory and Jeffrey Rydberg Cox. "New technology and new roles: the need for corpus editors," Proceedings of the Fifth ACM Conference on Digital Libraries, 2000, Pages 252-253.
Information extraction (IE), of which named entity recognition is typically viewed as a subtask, has been explored for many years and has an extensive body of literature. The information extraction task is the backbone of the influential Message Understanding Conference (MUC). A good general overview of IE and MUC can be found in (Cowie 1996). The following list provides sources of further reading for exploring this topic.
Appelt, D.E., "Introduction to Information Extraction." AI Communications, 12.3 (1999), pp. 161-72.
Cowie, J. and W. Lehnert, "Information Extraction," Communications of the ACM 39.1 (1996), pp. 80-91.
Cunningham, H. "Information Extraction, Automatic". Encyclopedia of Language and Linguistics, 2nd Edition, (2005).
McCallum, Andrew. "Information Extraction: Distilling Structured Data from Unstructured Text." ACM Queue, 3.9 (2005), pp. 48-57. (Also includes an excellent list of articles for further reading) .
Mooney, Raymond J. et. al., "Mining Knowledge from Text Using Information Extraction." SIGKDD Explorations, 7.1 (2005), pp. 3-10.
Valero-Tellez, A., et. al. "A Machine Learning Approach to Information Extraction." Proceedings of CICLING (2005), pp. 539-47.
Named Entity Recognition
Bikel D.M., R. L. Schwartz, and R. M. Weischedel, "An Algorithm that Learns What's in a Name." Machine Learning, 34(1-3) (1999), pp. 211-231.
Evans, R. "A Framework for Named Entity Recognition in the Open Domain." In Proceedings of RANLP, pages 137-144, 2003.
Hasegawa, T. et. al. "Discovering Relations among Named Entities from Large Corpora." Proceedings of the Annual Meeting of Association of Computational Linguistics (ACL 04) 2004; Barcelona, Spain.
Pedersen, T., et. al. "Name Discrimination by Clustering Similar Contexts." Proceedings of CICLING 2005, pp. 226-237.
Sekine, S., "Named Entity: History and Future." <http://www.cs.nyu.edu/~sekine/papers/NEsurvey200402.pdf>.
Zhou, G. and J. Su, "Machine Learning Based Named Entity Recognition Via Effective Integration of Various Evidences." Natural Language Engineering, 11.2, pp. 189-206.
Copyright © 2006 Gregory Crane and Alison Jones