Stories

D-Lib Magazine
January 1999

Volume 5 Issue 1
ISSN 1082-9873

Geographic Names

The Implementation of a Gazetteer in a Georeferenced Digital Library

blue line

Linda L. Hill, James Frew, and Qi Zheng
Alexandria Digital Library Project
University of California, Santa Barbara

[email protected]

 

Abstract

The Alexandria Digital Library (ADL) Project has developed a content standard for gazetteer objects and a hierarchical type scheme for geographic features. Both of these developments are based on ADL experience with an earlier gazetteer component for the Library, based on two gazetteers maintained by the U.S. federal government. We define the minimum components of a gazetteer entry as (1) a geographic name, (2) a geographic location represented by coordinates, and (3) a type designation. With these attributes, a gazetteer can function as a tool for indirect spatial location identification through names and types. The ADL Gazetteer Content Standard supports contribution and sharing of gazetteer entries with rich descriptions beyond the minimum requirements. This paper describes the content standard, the feature type thesaurus, and the implementation and research issues.

Introduction

A gazetteer is list of geographic names, together with their geographic locations and other descriptive information. A geographic name is a proper name for a geographic place and feature, such as Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California.

There are many types of printed gazetteers. For example, the New York Times Atlas has a gazetteer section that can be used to look up a geographic name and find the page(s) and grid reference(s) where the corresponding feature is shown. Some gazetteers provide information about places and features; for example, a history of the locale, population data, physical data such as elevation, or the pronunciation of the name. Some lists of geographic names are available as hierarchical term sets (thesauri) designed for information retreival; these are used to describe bibliographic or museum materials. Examples include the authority files of the U.S. Library of Congress and the GeoRef Thesaurus produced by the American Geological Institute. The Getty Museum has recently made their Thesaurus of Geographic Names available online. This is a major project to develop a controlled vocabulary of current and historical names to describe (i.e., catalog) art and architecture literature. U.S. federal government mapping agencies maintain gazetteers containing the official names of places and/or the names that appear on map series. Examples include the U.S. Geological Survey's Geographic Names Information System (GNIS) and the National Imagery and Mapping Agency's Geographic Names Processing System (GNPS). Both of these are maintained in cooperation with the U.S. Board of Geographic Names (BGN). Many other examples could be cited -- for local areas, for other countries, and for special purposes. There is remarkable diversity in approaches to the description of geographic places and no standardization beyond authoritative sources for the geographic names themselves.

The Alexandria Digital Library (ADL) at the University of California at Santa Barbara is a geolibrary where a primary attribute of collection objects is their location on Earth, represented by geographic footprints. A footprint is the latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary. Figure 1 illustrates this approach by showing a screen shot of the ADL client known as JiGi, the "Java Interface to Georeferenced Information."

Figure 1. Screen Shot from the ADL client (JiGi 1.7) showing collection object footprints (upper left) associated with a query result set (upper right). One item is highlighted at the upper right, a thumbnail view of the image is shown in the same window, and the associated footprint is shown in red in the upper left window. The windows along the bottom show (from left to right) the Search Window where query parameters are set, the context-sensitive Help Window, and the Query Status Window.

Footprints can be attached to any type of information about a geographic location, including maps, aerial photographs, and remote sensing images as well as text items, museum objects, specimens, data sets, music -- any georeferenced piece of information. Figure 2 shows another screen shot of the ADL client, this time illustrating the results of a search of the gazetteer for entities in the query area of the type volcanoes.

Figure 2. Screen shot of the ADL client (JiGi 1.7). The upper window shows the footprints in the query region for a set of volcanoes from the gazetteer. The lower window shows a portion of the volcano listing. The entry for "Etna" is highlighted in the lower window and its footprint is shown in red in the upper window.

A gazetteer supports several functions of a geolibrary:

  1. It answers the "Where is" question; for example, "Where is Santa Barbara?"
  2. It translates between geographic names and locations so that a user of the library can find collection objects through matching the footprint of a geographic name to the footprints of the collection objects.
  3. It allows a user to locate particular types of geographic features in a designated area. For example, the user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes (as in Figure 2) in the area. This is possible because of the third required component of a gazetteer entry -- the type (or category) of place.

 

Figure 3. Essential elements of a gazetteer for digital libraries (latitude and longitude coordinates shown in decimal degrees)

ADL took the first steps toward building its online gazetteer by integrating the GNPS (non-U.S. names) and the GNIS (U.S. names) gazetteers. The combined gazetteer had approximately 5.9 million geographic names categorized by a hybrid of the feature classes and types from the two sources. All of the names were identified by point locations. Point locations serve to disambiguate geographic names from one another and thus were sufficient for the purposes of the two government agencies. For a digital library application, the spatial extent of the feature, either approximately with a bounding box or more accurately with a polygonal representation, is better, but there are no large sets of gazetteer data with spatial extents. The point locations proved to be useful because they do serve to locate a place approximately and the user is free to draw a box around a selected point to use as a query area.

This gazetteer building experience demonstrated both the value of online gazetteers in digital libraries and their current limitations as spatial identification and retrieval tools. Specifically, we identified the following criteria for integrating gazetteers into digital libraries:

  1. Content Standard: There is a need for a standard conceptual schema for gazetteer information, so that this information may be more easily created and shared. There are many sources of spatially referenced geographic names, but they are mostly for specific purposes only and not designed to be interoperable or shareable.
  2. Feature Types: There is a need for a type scheme to categorize individual features for shared gazetteers. This scheme needs to be hierarchical, rich in term variants, and extensible to accommodate greater depth in terminology where needed. To be practical, this scheme needs to incorporate variant forms of terminology from established feature type schemes so that it can provide mappings between the various schemes.
  3. Temporal aspects: Geographic names, their footprints, their relationships to other places, and their associated descriptive elements all change through time. Gazetteers must therefore incorporate temporal ranges for this data.
  4. "Fuzzy" footprints: Since the extent of a geographic feature is often approximate or ill-defined (e.g., Southern California), there is a need for rules and methods of elicitation by which these "fuzzy" boundaries and locations are derived and presented to users.
  5. Quality aspects: Several aspects of gazetteer data quality need to be addressed. One is how to indicate the accuracy of latitude and longitude data. Another is the need to ensure that the reported coordinates agree with the other elements of the description. In general, data quality checks should be built in wherever possible for all data elements.
  6. Spatial extents: Many currently available gazetteers contain point locations only, often derived as a by-product of map production. Points do not represent the extent of the geographic locations and are therefore only minimally useful. Bounding boxes, while sufficient for many search purposes, often misrepresent the feature by including too much territory (for example, the bounding box for California also includes Nevada). In general, there is a need to represent the spatial extent of gazetteer entries with more bounding boxes and detailed boundaries. Establishing the standards that will enable the sharing of gazetteer data will help harvest data from many sources, but ultimately deriving spatial locations and extents from digital mapping products and other sources automatically will be needed.

International and multilingual gazetteers raise additional issues. Perhaps the most difficult are the use of non-ASCII character sets, transliterations, and multilingual category sets. Fortunately, the use of coordinates for georeferencing operates across national and language barriers.

The next section of this paper gives details about the implementation of the original ADL gazetteer. This is followed by a description of the ADL Gazetteer Content Standard and the ADL Feature Type Thesaurus. Finally, there is a section on the issues involved in applying the Content Standard and the Thesaurus to imported gazetteer data.

Construction of the Initial ADL Gazetteer

The ADL model requires a comprehensive gazetteer as part of its spatial query function to provide access to spatially defined information by geographic name. As a start, ADL obtained two databases from the U.S. federal government and combined them. The first was the Geographic Names Information System (GNIS) from the U.S. Geological Survey and the second the Geographic Names Processing System (GNPS) from the Defense Mapping Agency (now the National Imagery and Mapping Agency). Both agencies are partners in the Alexandria Digital Library and both operate under the aegis of the Board on Geographic Names (BGN). The Board is authorized to establish and maintain uniform geographic name usage throughout the federal government. Sharing its responsibilities with the Secretary of the Interior, the Board has developed principles, policies, and procedures governing the use of both domestic and foreign geographic names as well as undersea and Antarctic feature names. Although established to serve the federal government as a central authority to which all name problems, name inquiries, and new name proposals can be directed, the Board also plays a similar role for the general public.

The GNIS and the GNPS offer the best available worldwide coverage of geographic place names. Each database has its own characteristics and the ADL staff attempted to retain the structure and detail of both. In the combining process, several alterations were needed (see below). The combined data set contains nearly six million entries.

The GNIS was developed to handle geographic name information for the official mapping and publishing programs of the federal government; to answer public inquiries about geographic names; and to provide an official geographic database for other data management activities. The system includes the official names (and often variant spellings and names) of all named places, features, and areas under the sovereignty of the United States. Each feature is identified by state, county, geographic coordinates, type of feature, and reference to the appropriate 1:24,000-scale U.S. Geological Survey (USGS) topographic map on which it appears. GNIS records almost 2 million geographic features -- from populated places, schools, reservoirs, and parks, to streams, valleys, springs, and ridges. The database is being compiled in two phases. The first phase is complete for all States and areas under U.S. jurisdiction and includes most feature names found on the 1:24,000-scale maps published by the USGS, as well as names on National Ocean Service charts, U.S. Forest Service maps, and in data files of the Army Corps of Engineers, the Federal Aviation Administration, and the Federal Communications Commission. The second phase of compilation is ongoing, and involves the collection of current and historical names from official State publications and local materials.

The GNIS can be searched through a World Wide Web interface <http://mapping.usgs.gov/www/gnis/>. Figure 4 shows an example of the type of report that is returned. The columns provide information for Feature Name, State, County Name, Type, Latitude and Longitude (in degrees, minutes, seconds), and the name of the USGS 7.5 minute topographic quadrangle map where the feature can be found.

FEATURE NAME

ST 

COUNTY NAME

TYPE

LATITUDE

LONGITUDE

USGS 7.5' MAP  

Tulsa  

OK 

Tulsa

pop pl  

360914N   

0955933W

Tulsa 

Tulsa Country Club  

OK

Osage

locale

360958N 

0960012W

Sand Springs

Tulsa County        

OK

Tulsa

civil

360600N  

0955400W

Jenks

Tulsa Helicopters Incorporated Heliport

OK

Tulsa

airport

360500N

0955205W

Broken Arrow

Figure 4. Example of report from the GNIS Gazetteer (March 29, 1998).

The Geonet Names Server provides access to the GNPS through a National Imagery and Mapping Agency (NIMA) web site <http://164.214.2.59/gns/html/index.html>. Figure 5 shows an example of the type of report that is returned. The columns provide information for Name, Designation (Type), Latitude and Longitude (in degrees, minutes, seconds), FIPS code for the area, Universal Transverse Mercator (UTM) grid number, and Joint Operations Graphic (JOG) map number for the area.

NAME

DESIG

LATITUDE 

LONGITUDE

AREA

UTM

JOG NO

Preston

PPL

50�39'00"N

2�25'00"W

UK00

WB41

NM30-05

Preston: see East Preston

PPL

50�47'00"N  

0�28'00"W

UK00

0XB72

NM30-06

Preston

PPL

50�50'00"N  

0�08'00"W

UK00

YB03

NM30-06

Preston

PPL

51�18'00"N  

1�14'00"E

UK00

CS78 

NM31-01

Figure 5. Example of report from the NIMA Geonet Names Server (GNPS) (March 29, 1998).

ADL merged these two gazetteers to build the original ADL gazetteer. Creating a single set of categories for the merged feature types proved to be the most challenging part of the merging process.

The GNIS Feature Type is designed to group similar features into broadly designated categories. These categories are not a government standard; they were designed solely to facilitate retrieval. The GNIS Feature Type list is shown in Figure 6.

  airport
  building
 dam
  isthmus
 po
  stream
  arch
  canal
  falls
  lake
  ppl
  summit
  area
  cape
  flat
  lava
  range
  swamp
  arroyo
  cave
  forest
  levee
  rapids
  tower
  bar
  cemetery
  gap
  locale
  reserve
  trail
  basin
  channel
  geyser
  mine
  reservoir
  tunnel
  bay
  church
  glacier
  oilfield
  ridge
  valley
  beach
  civil
  gut
  other
  school
  well
  bench
  cliff
  harbor
  park
  sea

  woods

  bend
  crater
  hospital
  pillar
  slope

  bridge
crossing
  island
plain
  spring

Figure 6. GNIS Feature Type List

GNIS has implemented a mapping from the "generic" terms in geographic names to these categories. For example, in the name "June Lake" the word "lake" is mapped to the category lake. Other generic terms mapped to lake are "backwater," "lac," "lagoon," "laguna," "pond," "pool," "resaca," and "waterhole." When ADL first processed the GNIS data, there were approximately 1900 different generic words mapped to 62 type categories. We found the following problems with this mapping:

  1. For about 100 out of the 1900 generic words, there are one-to-many relationships between the generic terms and possible categories, so some other clues (usually human judgement) have to be used to determine which categories to use. Some GNIS generic terms have comments to help with this decision (for example, "draw (deep)" maps to valley but "draw (narrow)" maps to arroyo). Some of the most difficult generic words in this regard are terms that refer to an aspect of the feature such as its shape. An example is "finger," which could be either a lake or a pillar feature.
  2. Geographic names can be poor indicators of feature type, since any name can be given to any feature. Often the generic part of the name does not reflect the authoritative definition of the feature type; that is, someone knowledgeable about the definitions of the terms would likely have assigned a different generic term.
  3. The GNIS feature type definitions are terse. The generic term mappings are the best source of information about what a category includes. Some GNIS feature types show a lot of separation from other types; for example, the generic mappings for airport, military, and church do not show any overlap with other feature types. Other categories, however, are difficult to separate; for example, stream, channel, canal, gut, swamp, bay, and lake have many generic mappings in common. When a category is very broad, such as the GNIS area and locale categories, it essentially becomes a dumping ground for a wide range of feature types not otherwise categorized.

The GNPS category system is a hierarchy of nine feature classes and 638 feature types -- each type appears in only one class. To integrate the two gazetteers, the GNPS scheme was compared to the GNIS feature types and generic terms. Complex relationships were found that made the integration difficult, but it was possible to use most of the GNPS feature classes for the top level of the hierarchy, and the GNIS feature types as the next level down. Although most of the categories were maintained, some classes and feature types had to be adjusted:

  1. The GNPS classes Area Features and Spot Features were not used because they were based on the size or extent of a feature rather than on its type. Likewise, geographic names in the GNIS "area" category were moved to more specific categories, or to the "other" feature type.
  2. Other GNIS feature types were kept in the merged ADL file, with the following exceptions:

The resulting class and feature type scheme for the original ADL gazetteer is shown in Figure 7.

ADMINISTRATIVE
AREAS
MANMADE/
CULTURAL
civil building
military cemetary
reserve church
HYDROGRAPHY dam
bay harbor
canal hospital
channel levee
falls mine
geyser oil field
glacier other
gut school
lake tower
rapids PPL
(Populated Place)
reservoir locale
sea ppl
spring TRANSPORTATION
stream airport
swamp bridge
well road
HYPSOGRAPHIC/
RELIEF
(INCLUDING
UNDERSEA
FEATURES)
tunnel
arch trail
arroyo VEGETATION
bar agriculture
basin forest
beach park
bend vegetation
cape woods
cave  
cliff  
crater  
flat  
gap  
island  
isthmus  
lava  
pillar  
range  
ridge  
slope  
summit  
valley  
Figure 7. Original Feature Class and Type Scheme Based on a Merger of the GNIS and GNPS.

ADL Gazetteer Content Standard

The original ADL gazetteer database structure is very simple. An entry records only the full-character name (with representation of non-ASCII characters), the sort name (with non-ASCII characters removed), the class, the feature type, the coordinates for the point location, and the source of the entry (i.e., either GNIS or GNPS). These attributes were sufficient for the initial purpose of providing geographic name identification and access through ADL. Users can search for geographic names or types of geographic features in the gazetteer, within a specified area or worldwide. ADL returns matching names and displays their point locations on a map. This is often sufficient to tell the user where a feature is located. A point location can also be the focal point of a user-drawn query box on an interactive map to search for library holdings whose footprints overlap or are contained within the query box.

The original gazetteer design, however, did not satisfy ADL's requirements for a gazetteer that is (1) populated with more expressive footprints (e.g., bounding boxes), (2) able to represent historical names and footprints (i.e., date attributes) and, (3) associated with a well-designed hierarchical type scheme. The next stage of development was therefore to develop a Gazetteer Content Standard (GCS) and a Feature Type Thesaurus (FTT).

The GCS is designed to support contribution and sharing of gazetteer information. A thesaurus structure for the gazetteer was considered but rejected because it is not as well suited to contribution and sharing between multiple originating sources. ADL has developed a relational database model for the GCS and is populating the database with sets of gazetteer data. Comments on our initial experiences of converting data to the new GCS are discussed below.

The GCS specifies required and repeatable attributes. It supports contribution from multiple sources by (1) having metadata entries for each geographic name, (2) recording the contributor, source, and entry date for each attribute-value component of the entry, and (3) representing the relationships between geographic names through links between entries rather than by a thesaurus structure. The descriptive elements in the standard include:

  1. Designated name by which the entity is known and any variant names. All names can have the following optional attributes: name source, etymology, language, pronunciation, transliteration scheme, and character set. All names are flagged as current or historical and can have beginning and ending dates.
  2. Footprints can be entered as points, lines, bounding boxes, or bounding polygons. A single gazetteer entry may have multiple footprints and each can be described as current or historical with beginning and ending dates. Measurement method, date, and accuracy can be recorded. Street addresses are also supported.
  3. The "relationship" section of the standard supports the relationship of one gazetteer entry to another (e.g., "IsPartOf"). These relationships can be specified as current or historical and date ranges given. These explicit relationships supplement the inherent geometric and topological relationships that can be derived from the entries' footprints.
  4. Descriptive information about the gazetteer entry is structured into (1) textual description, (2) geographic feature data, such as population, elevation, and length, (3) links (e.g., URLs) to related sources of information, and (4) a supplemental note.
  5. Contributor Information and Source Information are described by linking to the appropriate Contributor and Source descriptive entries.

Figure 8. Illustration of the ADL Gazetteer Content Standard

Figure 8 illustrates the use of the GCS in describing a place. This is an entry converted from the NIMA gazetteer, with some additional information. The full GCS specification, and PDF diagrams of the GCS relational database schema, are viewable through the first author's homepage (http://www.alexandria.ucsb.edu/~lhill).

ADL Feature Type Thesaurus

The Feature Type Thesaurus (FTT) has been built as a hierarchical scheme in accordance with the ANSI/NISO Guidelines for the Construction, Format, And Management of Monolingual Thesauri (Z3919-1993). A thesaurus software package was selected and installed to support the building and maintenance of the thesaurus. Terms from various sources were reviewed and evaluated, especially the work described above to develop the feature type scheme for the original ADL Gazetteer. A small set of top terms (major categories) was chosen to give structure to the hierarchy:

Preferred terms were chosen based on their use in reference sources and available dictionaries. Other closely related terms were entered as non-preferred terms, providing a rich set of pointers to the preferred terms. This process iterated until the structure and design became stable enough that new terminology could be easily integrated into the existing structure.

Sources consulted in building the thesaurus included:

  1. Documentation from the building of the original ADL gazetteer.
  2. Appendix to U.S. Geological Survey Circular 1048 which describes Enhanced Digital Line Graph feature types.
  3. The feature categories used by the National Imagery & Mapping Agency (NIMA) for its Information System Data Model.
  4. Feature types used by the Canadian Permanent Committee on Geographic Names (CPCGN).
  5. Terms used to describe geographic place entries in the GeoRef Thesaurus.
  6. Terms used to categorize the entries in the Getty Thesaurus of Geographic Names.
  7. Terms from the Getty Art & Architecture Thesaurus.
  8. Terms from the Library of Congress Thesaurus of Graphic Materials II.
  9. Definitions from A Dictionary of the Natural Environment.
  10. Definitions from The New Penguin Dictionary of Geology.
  11. Definitions from The American Heritage Dictionary.
  12. Experience in converting gazetteer data sets to the new ADL gazetteer database schema.

The basic thesaurus design decisions were:

  1. Generic relationships are implied between broader and narrower terms. That is, the narrower term is a member of the broader class.
  2. Plural terms are used instead of singular terms.
  3. Multiple hierarchies are allowed but not often used.
  4. The depth of the hierarchy (i.e., the specificity of the preferred terminology) was heuristically determined based on the specificity likely to be needed by ADL. Terms more specific than needed by ADL were entered as non-preferred terms pointing to the broader term. If needed later, these non-preferred terms for more specific concepts can be changed to preferred narrower terms. For example, specific types of wetlands, such as swamps and bogs, are currently entered as non-preferred terms pointing to wetlands. A user looking for swamp features will be advised to use wetlands for the search. One reason for limiting the depth of the hierarchy is that ADL must perform feature type conversions when bringing in sets of gazetteer data and choice of feature type can be no more specific than can be justified by the clues in the incoming data.
  5. All terms are in English. The MultiTes software, however, can handle parallel versions of thesauri in different languages and such an enhancement would be a natural development as the thesaurus matures.
  6. Terms are not capitalized and are used in their natural order, rather than inverted. For example, drainage basins is used instead of basins, drainage.

Figure 9 illustrates the FTT by showing the entry for one term, hydrographic features. This set of terms can be compared to the hydrography section of the original ADL class and feature type scheme. Note that the Narrow Terms can themselves have narrower terms that are not shown here. The Used For terms are referrals from terms considered to be synonymous for the purposes of this set of terminology.

Figure 9. Illustration of the ADL Feature Type Thesaurus

The ADL Feature Type Thesaurus can be browsed at <http://alexandria.sdc.ucsb.edu/~lhill/html/index.htm >.

Feature Type Conversion Issues

To populate the new ADL gazetteers, incoming entries must be mapped to the GCS relational database and incoming category terminology must be mapped to the FTT. Mapping the feature type terminology is always one of the hardest parts of the conversion. The problems can be characterized as ones of specificity (where the incoming data has more specific terms than the FTT); generality (where the incoming data has broader categories than the FTT); definition (where it isn�t clear how to interpret the scope of an incoming category); and no category (where there has been no attempt to categorize the placenames).

Specificity mismatches are solved by relating the more specific incoming terms to the appropriate broader terms in the FTT. For example, when a new term for a type of wetlands is found in the incoming data, this term is added to the FTT with a reference to "USE wetlands." Wetlands will then be used as the FTT category for all of the geographic names of that category in the incoming data. Sometimes a more specific term cannot be completely mapped to just one FTT term. For example, a category such as abandoned banana plantation (a NIMA term) is mapped to both agricultural sites and historical sites in the FTT.

Generality mismatches are much more difficult. Most of the category schemes we have seen have catchall categories such as other and area that have no equivalent in the FTT. Applying a more specific category to these geographic names is problematic. In some cases, the names themselves may provide some clues to what they are. But, as described above, there are many problems with determining category from the placenames themselves. Our choices in these cases are to (1) consider these entries as "uncategorized" and include them in the ADL gazetteer without an ADL feature type or (2) develop conversion algorithms that use name components and other clues to pick an appropriate type. We make the choice on the basis of how successful we think we will be in deriving appropriate categories.

Definitions for categories are very useful for developing conversion mappings. Where definitions do not exist, unfamiliar terms can be looked up in other reference sources. Although this could lead to the category being interpreted incorrectly, the FTT is general enough that it is usually possible to map to a broader term with confidence.

Some sources of gazetteer information do not categorize their geographic names explicitly. This is true, for example, of the geographic name entries in the GeoRef Thesaurus. For their purposes, the type of a place is not a significant descriptive element. Fortunately, most of their "scope notes" for terms refer to a place by its type, and this can be used to assign a category. This does not hold, however, for state and country names. For our conversion, we can compare these names to lists of states, provinces, and countries to make this determination. Lack of type information is also true for geographic names and extents that can be extracted from digital map products. The map layer that contains the feature and some broad categories (e.g., major cities) applied to features provide some information, but not as much as we need to assign the most appropriate type category from the FTT.

Other Conversion Issues

We recently converted the NIMA gazetteer to the ADL GCS database. The problems we encountered are illustrative. The next conversion will bring another set of issues but these show the difficulties of doing just one particular conversion to show how we could benefit greatly from a shared approach to gazetteer representation.

  1. We have two versions of the NIMA gazetteer data. One is a Sybase(tm) table dump. The geographic names in this version contain diacritical marks that our software cannot display or search correctly, so we could not use it. The other is in HTML format, with a file for each country. Since most of these files have the same structure, we developed an algorithm to delete the HTML tags and create ASCII plain text files that we then used as the source data for our conversion.
  2. NIMA has several ways to indicate variant names for a single geographic feature. In some cases, they have separate "see" references from the variant name to the "authorized name." In other cases, they include several names in one entry and modify the names with either (1) the countries in which the name is used, or (2) the language that the name is in. The only difference between the country and language modifiers is that the country modifiers are all upper-case whereas the language modifiers are mixed case. We identified the "authorized name" and linked the other names to it. Case-sensitive algorithms were required to identify language and country modifiers: language was recorded in the Language element and country was recorded in the Etymology element (for lack of a better treatment) for each name version.

    We also found about 4,000 variant names that point to an "authorized" name that is either "NAME NOT KNOWN" or that does not exist in the data we received. In these cases, we used the variant name as the name for the feature. We did not load any of the "NAME NOT KNOWN" entries, since we consider a gazetteer to contain entries for named features.

  3. NIMA uses the FIPS Publication 10-4 to identify geographic entities. Unfortunately, the FIPS code in a NIMA entry may either be (1) the code for the place itself or (2) the code for the larger place that contains it. Moreover, many codes were either clearly incorrect or missing from the lists of FIPS codes publicly available from NIST or included on the NIMA CD-ROM. We ended up deciding not to use the codes directly. We did use the first two characters of the code to find the place's country. This worked well for a relationship entry in most cases. We would like to find a way to add FIPS codes later.
  4. We cannot verify the accuracy of either the UTM areas or the JOG map areas listed for each gazetteer entry (see Figure 5). We entered the information as-is, but wish we could identify possible misplaced points.
  5. All of the NIMA data includes point locations only. This includes countries for which we already have bounding box entries in our new ADL gazetteer from another source. A simple solution would be to ignore the NIMA country entries that duplicate the ADL entries, but the NIMA data includes variant names and other useful information. We identified 153 duplicate country entries and replaced the NIMA point data with bounding box footprints, noting the contributor and source for each piece of data (a feature of the Gazetteer Content Standard).
  6. The relationships of point locations to their corresponding features are not explained. For example, we cannot tell if a point is a central point, a point where a feature was labeled on a map, or the mouth of a river or some other random point along a river. This type of information would be very useful for interpretation and subsequent use of the data.

Converting of the NIMA gazetteer to ADL took about 2 weeks of staff time to develop the conversion mapping, and about a month to actually convert and load the data. Some of this effort is reusable, especially the new terms added to the FTT, but most of it was custom mapping from the NIMA structure to ours and will not be applicable to any other conversion.

Current Status and Future Development

ADL is currently populating the new gazetteer database. Several sets of gazetteer data have been obtained and will be converted and loaded. This process is very labor intensive, as discussed above. We anticipate that ADL will build several gazetteers, rather than one huge one as before; but the details of how this will be done, and how references among them will be made, have yet to be worked out.

Six issues were identified at the beginning of this paper. The work that ADL has reported on here responds to the first three issues: a content standard, a system of feature types, and the representation of the temporal aspects of gazetteer data. We will continue to engage the creators and users of gazetteers in testing and endorsing this approach, as well as gaining support for research into the last three issues: eliciting and representing fuzzy areas, quality control of gazetteer data, and obtaining bounding boxes and polygonal extents.

Acknowledgements

Funding from NSF, DARPA, and NASA under NSF IR94-11330 supported the Alexandria Digital Library Project. The NASA EOSDIS Project has also contributed to the funding of the design and implementation of the gazetteer. We would also like to thank the members of the ADL Implementation Team and the Map and Imagery Laboratory of the Davidson Library at the University of California, Santa Barbara.

References

Alexandria Digital Library. (1998). Homepage. http://www.alexandria.ucsb.edu.

American Geological Institute. (1994). GeoRef Thesaurus. (7th ed.). Alexandria, Va.: American Geological Institute.

American Heritage Dictionary of the English Language. (1992). (3rd ed.). Boston: Houghton Mifflin.

Aurand, M. (1995). Gazetteer Report. (Internal report in looseleaf binder). Santa Barbara: Map and Imagery Laboratory, Davidson Library, University of California, Santa Barbara, 1995.

Getty Information Institute. (1997). Thesaurus of Geographic Names. http://www.ahip.getty.edu/tgn_browser/.

Getty Information Institute. (1998). The Art and Architecture Thesaurus Browser. http://www.ahip.getty.edu/aat_browser/.

Goodchild, M. F. (1998 forthcoming). The geolibrary. In S. Carver (Ed.) Innovations in GIS. London: Taylor and Francis.

Guptill, S. C. (1990). An Enhanced Digital Line Graph Design: A Feature-based Data Model for Digital Spatial Data Bases that Represents Geographic Phenomenon.(Government report) U.S. Geological Survey Circular 1048. Washington, DC: U.S. Geological Survey, 1990.

Kearey, P. (1996). The New Penguin Dictionary of Geology. London: Penguin Books.

Mackay, D., John Bartholomew and Son, & Times Books. (1992). The New York Times Atlas of the World. (3rd rev. concise, 3rd US ed.). New York, N.Y.: Times Books.

Monkhouse, F. J., & Small, J. (1978). A Dictionary of the Natural Environment. New York: Halsted.

Multisystems. (1997). MultiTes: Thesaurus Construction Made Easy (Version 6). Miami, FL: Multisystems, P.O. Box 833205, Miami, FL 33283-3205.

National Information Standards Organization (U.S.). (1994). Guidelines for the construction, format, and management of monolingual thesauri. Z39.19-1993. Bethesda, MD: NISO Press.

Secretariat of the Canadian Permanent Committee on Geographical Names. (1998). Canadian Geographical Names, Natural Resources Canada. http://geonames.nrcan.gc.ca/english/Home.html.

U.S. Board on Geographic Names. (1998). http://mapping.usgs.gov/www/gnis/bgn.html and http://164.214.2.59/gns/html/BGN.html.

U.S. Geological Survey. (1998). Geographic Names Information Service (GNIS). http://mapping.usgs.gov/www/gnis/.

U.S. Library of Congress. (1998). Thesaurus for Graphical Materials II. Genre and Physical Characteristic Terms. http://lcweb.loc.gov/rr/print/tgm2/.

U.S. National Imagery & Mapping Agency. Architecture Office. (1997). The United States Imagery & Geospatial Information System Data Model (USIGS/DM); Interim Report. Looseleaf government report. Washington, DC, July 18, 1997.

U.S. National Imagery and Mapping Agency. (1998). Geonet Names Server. http://164.214.2.59/gns/html/index.html.

U.S. National Institute of Science and Technology (NIST). (1998). Countries, Dependencies, Areas of Special Sovereignty, and Their Principal Administrative Divisions. Federal Information Processing Standard (FIPS) Publication 10-4, http://www-09.nist.gov/div897/pubs/index.htm

Copyright © 1999 Linda L. Hill, James Frew, and Qi Zheng

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | Home| E-mail the Editor

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/january99-hill