Indexing the Historical Collections

The indexing that allows users to search for individual items within the historical collections is currently implemented using INQUERY, an indexing and retrieval "engine" developed at the Center for Intelligent Information Retrieval (CIIR). CIIR is partially funded by the National Science Foundation as a State/University Cooperative Research Center. Government agencies, universities, and corporations that join the Center may license software developed at CIIR to build customized information retrieval applications.

INQUERY is not so much a "package" as a kit of parts from which an information retrieval application can be assembled. The Library runs the software on an IBM RS 6000 computer using the AIX version of UNIX. The Library has worked actively with CIIR, sharing feedback from users and suggesting enhancements to INQUERY that have been implemented in later versions. Some of the Library's concerns are reflected in What Do People Want from Information Retrieval?, an article by Bruce Croft of CIIR in the November 1995 issue of D-lib magazine.

INQUERY at the Library of Congress

INQUERY is used to index text for several other Library of Congress services for the public, including:

THOMAS, for access to current legislative information. THOMAS was implemented in less than a month and launched in early January 1995 at the request of the newly-elected leadership of the U.S. House of Representatives. Early experiences with INQUERY in the THOMAS system were described at the Second International Conference on the Theory and Practice of Digital Libraries in June 1955.
Country Studies/Area Handbook Program, for a collection of books describing and analyzing the political, economic, social, and cultural systems and institutions of foreign countries. On-line versions of over sixty handbooks from the Federal Research Division will be available later in 1996.
Global Legal Information Network (GLIN), for over 50,000 abstracts of national laws passed in countries around the world since 1976. Older records were compiled by the U.S. Law Library of the Library of Congress; some countries are now contributing information directly over the Internet.
Vietnam Era Prisoner of War/Missing in Action Database, for information on documents held by the Library of Congress (on microfilm) relating to military personnel missing, killed, or imprisoned in Southeast Asia.

Selected features of INQUERY

INQUERY can index text in a number of formats, including HTML. It can index a collection of files or a collection of records within a file. A recent upgrade provided capabilities for customers to develop their own translators to parse records in other structured formats. The Library has developed a translator that will parse MARC catalog records, select MARC fields and subfields for indexing, and assign their contents to indexing fields. For example, the different MARC subject fields could be combined into a single field for searching and retrieval. The choice of whether or how to group MARC fields for indexing can be made independently for different applications. The ability to search for terms in particular fields has not yet been integrated into the interface for American Memory. However, searching by title, creator, subject, and call number has recently been built in to the Digital One-Box, a parallel effort that provides access in the Prints and Photographs reading room to a catalog of images, some of which are available in digital form. Lessons learned from this experience may be adapted and integrated into the American Memory interface.

To support a variety of search interfaces, INQUERY provides over fifteen different query operations that can be combined when performing a search. These operations include functions familiar to librarians, such as strict boolean operators, proximity operators that specify order, and proximity operators that will retrieve documents that contain a set of words within a certain number of words, whatever order they occur in. Other operators incorporate a probabilistic framework that has proved valuable for retrieving documents from collections of full text by ranking the documents in order of "relevance." One such operator, usually the default for INQUERY, adjusts the rank of a document upwards if a query term occurs frequently in the document and downwards if the term occurs in many documents within the indexed collection (and therefore is of little value for discrimination between documents). A variation of this basic operator allows terms in a query to be weighted. By combining this operator with the ability to search in individual fields, this feature can be used to give higher weight to terms found in titles than in the main text.

The Library has made selective use of INQUERY's array of query operations. Collections of text may have different characteristics and different choices of query operations may be called for to produce rankings of retrieved items that are satisfactory to the general user of that collection. The structure of documents, the pattern of vocabulary usage, and the most common information tasks associated with the legislative documents in THOMAS are very different from those associated with documents in the American Memory collections. The Library has customized the underlying query operations for each service independently. The customization is implemented through program modules (in the C programming language) invoked through the Common Gateway Interface (CGI) feature for World Wide Web servers. These modules build and execute INQUERY queries and assemble and display retrieval hit lists. These lists are saved in temporary files ready for use by other modules which build displays for individual items on the fly.

Current indexing and retrieval approaches for American Memory

For American Memory, each collection is currently indexed separately and for text collections, bibliographic records are indexed separately from the full text of searchable documents. For queries of more than one term, it was decided to perform four distinct searches and combine the results in a way that takes advantage of the best characteristics of both boolean and relevance-ranking systems. If a document or record includes a precise match for the query as a phrase or contains all the query terms, the document will be sure to appear near the beginning of the retrieval list. Documents that contain all the terms within a 20-word window will appear higher than documents in which the words are more dispersed. However, if there are no close matches, the default INQUERY operation will almost always retrieve some documents, even if they contain only one of the terms. It is well documented that novice users of boolean systems are often frustrated by frequent searches that return no hits. Since a primary audience for the American Memory resources is the K-12 educational community, it is important to provide a simple interface that students can use without training.

At present, the American Memory search interface presents a single "box" in which the user can enter terms. Using the approach described above, the search will retrieve any document that contains any of the words entered, wherever those words occur. The user has the choice of requiring matches to be exact or accepting word variants, such as plurals. For resources that are more structured and less heterogeneous than the historical collections, specialized query interfaces have been developed. Examples include search forms for THOMAS and for the Global Legal Information Network, released for public access on April 30, 1996. It is likely that an "expert" search form will be developed as an option for access to the historical collections.

This is only a snapshot

This brief summary describes the Library's indexing practices for its historical collections today. The details will certainly change as the archive expands and feedback from users suggests enhancements. INQUERY is a system under continuous development. The ability to search across the American Memory collections was added when an INQUERY upgrade in early 1996 provided support for simultaneous searching across INQUERY indexes. Another recently-added feature is the ability to index incrementally, rather than having to re-index an entire body of text, when a few documents are added. The next upgrade will support the assignment of variable weights to indexes specified in a query across several indexes. As the Library of Congress continues to mount text for public access, it will explore whether these features will serve to improve services. Meanwhile, the Library hopes that CIIR will add support for features now lacking, such as support for searching for terms within a sentence (as distinct from "within N words").

To date, INQUERY has been used to index all text associated with the historical collections available over the World Wide Web. However, the informal model for access to the historical collections introduced in a diagram in part 1 of this article allows for many indexes. Indexes for text collections could be supported by another indexing engine in the future. More than one indexing engine could be integrated into the overall system. The flexibility of the INQUERY toolkit has allowed the Library of Congress to build experience in the retrieval of information from full text of various types. This experience will be invaluable in the evaluation of other indexing and retrieval systems.