A Content Standard for Computational Models

D-Lib Magazine
June 2001

Volume 7 Number 6

ISSN 1082-9873

A Content Standard for Computational Models

Linda L. Hill
Scott J. Crosier
Terence R. Smith
Michael Goodchild
Department of Geography
University of California, Santa Barbara
Santa Barbara, CA 93106
[email protected]
[email protected]
[email protected]
[email protected]

	Abstract Computational models are created to simulate a set of processes observed in the natural world in order to gain an understanding of these processes and to predict the outcome of natural processes given a specific set of input parameters. Conceptual and theoretical modeling constructs are expressed as sets of algorithms and implemented as software packages. The modeling software packages, if adequately described for human understanding and machine processing, can become objects in digital library collections where they can be found and used in applications without the direct involvement of the creator. This amounts to the publishing of modeling software with accompanying metadata in the same way that other publications are treated in library collections. This paper addresses the requirements for a content standard to describe such computational models. This work is part of the Alexandria Digital Earth Prototype (ADEPT) project at the University of California, Santa Barbara, an NSF Digital Library II project (Alexandria Digital Library Project, 2001). The intent is to add modeling software packages as collection objects in the ADEPT collections to support research, education, and learning activities and to enable the matching of appropriate datasets in the digital library collections to modeling software. Introduction The creation of computational modeling software has grown at an accelerating rate since the earliest applications to the modeling of real-world phenomena during the 1940s. Supported by increasingly powerful hardware, software, and networking environments, growing numbers and varieties of computational models are being developed to support research, development, and education in all areas. For a variety of historical and technical reasons, including major problems of interoperability at all semantic levels and weak support for "publishing" computational models, library-based mechanisms to support the widespread distribution and use of modeling software have been slow to develop. While distributed digital libraries (DLs) and the worldwide web offer a natural infrastructure for such distribution, critical aspects of an effective infrastructure have yet to evolve. In particular, there are no generally accepted procedures for describing computational models in ways that support cataloging, search, selection, and use. In this paper, we propose a content standard for describing computational models. This Content Standard for Computational Models (CSCM) was developed partly in response to the general need for such descriptions and partly in response to the immediate needs of the Alexandria Digital Earth Project (ADEPT) at the University of California, Santa Barbara (UCSB). ADEPT is developing services that facilitate the construction of personalized digital collections that support learning in a variety of contexts. Since the Project views computational models of environmental phenomena as critical DL resources for helping students understand and reason scientifically about natural and human-influenced phenomena, it is useful to provide a metadata framework to standardize the way in which modeling software is described so that models can be integrated into DLs with other types of information in support of education and learning. Content Standards and Computational Modeling Software The primary purpose of CSCM is to provide enough information that potential users of the model (other than its creators) have a reasonable chance of finding it in a distributed DL environment, evaluating its potential applicability for their purposes (e.g., research, education), obtaining it, running it successfully in some computational environment and with appropriate datasets, and understanding the results. Computational modeling software will process certain kinds of data and produce specified output; it will incorporate certain variables and parameters; it will have known limitations and will be more suitable for some uses than for others; it may operate only in some computational environments and may require that other software packages be simultaneously available; and its use may be subject to licensing agreements. It is important to provide potential users with an understanding of these aspects and also with a sense of the theoretical and computational choices made by the modeler to represent the real-world phenomenon. All of these characteristics need to be documented in metadata, along with contact information for obtaining the software or getting help in using it. In relation to these primary goals, we note that the standard is not intended to define the manner in which the information is presented to a user, but to specify a description framework to support search, retrieval, and evaluation. The design of user interfaces and report presentations is an independent activity based on the metadata structure. We also note that the standard is not currently specified to the point of being able to fully support machine-machine analogs of such activities. In developing a CSCM, one must resolve some issues that are generic to metadata standards and others that are specific to computational models. We have adopted the metadata design framework of the International Standards Organization's TC 211 group for their metadata standard for geographic information (International Organization for Standardization (ISO), 2000), which is in turn based on the U.S. Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata (U.S. Federal Geographic Data Committee, 1998). Furthermore, we assume that metadata based on the CSCM will co-exist in DLs with other metadata structures. The use, for example, of a relatively small number of search buckets into which heterogeneous metadata descriptions can be mapped provides a useful mechanism for supporting interoperability among different metadata representations (Frew et al., 1999). There exist many definitions of models in general and computational models in particular (Aris, 1978 (reprinted 1994); Benz, 1997; Chorley, 1967; Dee, 1994). An adequate core definition of computational models for current purposes is: a set of computational codes, executable in some software/hardware environment, that transform a set of input data into a set of output data, with the input, output, and transformation typically having some interpretation in terms of real-world phenomena. Two specific examples of models satisfying this broad definition have been described using this initial version of the CSCM to test the design of the content standard (see <http://www.alexandria.ucsb.edu/doc>). The first model (Smith, 2001) takes the form of a set of C-Language codes that transform two initial input datasets into two output datasets. The input datasets represent a land surface and a flow of water over the surface; the transformation represents a time-dependent erosion process; and the two output datasets represent the land surface and water flow field at later times. The second example (Clarke, 2001) is a cellular automaton model of urban growth where multiple datasets showing a variety of land cover properties for at least four urban time periods are used for input, and the output visualizes and predicts urban growth into the future by using urban growth coefficients. Model metadata will continue to be created for the ADEPT project, with modifications to the CSCM as necessary. Several general considerations arise in deciding how to structure a content standard for computational modeling software. First, the syntactic and semantic complexity of many models makes it difficult to provide a definitive metadata description of reasonable length, more difficult than in the case of many other classes of digital objects. Hence, a specific strategy has been to assume that search, evaluation, and use are typically iterative processes, requiring that the metadata contain pointers to more detailed information, which in turn may contain other pointers. Second, it is useful to have a conceptual framework to help guide the design of a content standard for computational models. Drawing on various characterizations of models (Aris, 1978 (reprinted 1994); Benz, 1997; Chorley, 1967; Dee, 1994), we view models as generally having four increasingly specific levels of representation, in both syntactic and semantic terms. These are the conceptual, symbolic, algorithmic, and coding representations of the model. The conceptual representation describes the model at the highest level. For the erosion model, for example, it would characterize the model in terms of land and water surfaces and the conservation of water flowing over a surface and the conservation of sediment eroded from the surface and transported by the water. The symbolic representation is typically, but not always, in terms of some mathematical or logical language with an interpretation of the symbols in terms of real-world phenomena. In the case of the erosion model, this representation takes the form of two partial differential equations. The algorithmic representation provides a high-level view of how the symbolic representation is converted into a set of computations, while the coding representation of these algorithms provides codes that are, or can be compiled into, executables in some specific computing environment. The erosion model, for example, is specified at the algorithmic level by indicating that the water flow equation is transformed into a finite difference scheme using an upwind scheme and that the land surface erosion equation is transformed into a finite difference scheme using a Crank-Nicholson scheme. At the coding level, it is specified by a set of C-language programs and the environment in which they would run. Hence we may view the information represented in these four categories as moving from a high-level description of the model and its applicability to the details needed to execute it in a specific computation environment. The ADEPT CSCM provides a structure for these levels of description through narrative elements and elements for specific details of input and output variables, parameters, datasets, and processing flow. The CSCM consists of approximately 165 elements divided into ten sections: Identification Information Intended Use Description Access or Availability System Requirements Input Data Requirements Data Processing Model Output Calibration Efforts and Validation Metadata Source An outline of the elements (version dated May 2001) is included as an appendix to this article. The standard continues to evolve through interaction with the ADEPT metadata and collection building efforts. This version and the latest version of the CSCM are available for download from the Alexandria Digital Library Project's website (under Documentation) at <http://www.alexandria.ucsb.edu>. Here you will also find examples of the use of the content standard, including those referenced above. Content Standard Issues Of the many issues related to the design of the CSCM, the following are highlighted because they were central to our internal discussions. *Scope of models to be covered by the CSCM: This content standard is designed to describe computational models that have adjustable variables and parameters. This includes two basic sets of computational models: (1) modeling software and (2) modeling software that is packaged with datasets. Packages that contain both software and datasets are often published to illustrate specific phenomenon and to teach specific theoretical principles. Animations and simulations and similar visualizations that do not include adjustable variables and parameters are not covered by the CSCM. Metadata design: Our goal is to create a metadata structure comprehensive enough to describe a wide variety of computational models and similar enough to existing metadata structures for other types of objects (e.g., datasets, texts, photographs) to facilitate the incorporation of modeling software into DLs. Our design for element and entity definition is based on the metadata designs of the geospatial community (International Organization for Standardization (ISO), 2000; U.S. Federal Geographic Data Committee, 1998). The following section describes this approach in detail. As far as possible, we have tried to reuse sets of common elements for describing identification and descriptive aspects of models. Order of metadata sections and elements: The beginning sections include elements for narrative statements that give overviews of the model to help a potential user develop an overall understanding of what the model does and how it does it. Details of variables, parameters, input datasets, operating environments, processing functions, and outputs are left for later sections. A report/presentation might choose to list the details and the narrative statements in a different order. Links to external files: At several points in the CSCM, elements are provided to link to files of documentation about the model and to related information. In some cases, the metadata is written so that if the information covered by a set of elements is contained in an external file (e.g., descriptions of the variables, parameters, and processing flow) then the elements become "optional"; that is, the information does not have to be repeated in the metadata itself. This has implications for discovery and searching services that are based on metadata content. Geospatial elements: The section of the CSCM describing geospatial locations in terms of latitude and longitude coordinates, place or event names, and vertical (altitude and depth) dimensions was created by the ADEPT Project in coordination with the Digital Library for Earth System Education (DLESE) (Digital Library for Earth System Education, 2001). This section is designed to support basic geospatial description for discovery, search, and evaluation and will be useful for other metadata applications as well. CSCM Descriptive Design Identification* (CSCM sections 1 and 3) Identification elements supply the basic citation information, such as title, responsible parties, version, date, and identification numbers. Description elements include conceptual and symbolic-algorithmic descriptions of the model, model typology, topic or field of study, geographic and temporal coverage, and links to related models and to additional information about the model being described. Fitness for Use (CSCM sections 2 and 9) Creators of models have information about the intended use of the model, in terms of the intended application and, if designed for an education purpose, the intended educational level. This information is useful, along with the description of the conceptual, symbolic, algorithmic, and processing details, to determine if a particular model is suited for a particular use. One of the key purposes of model documentation is to provide information about the calibration and validation tests that have been used, the experiments that have been run, the peer reviews that have been published, and the current known uses. Particularly useful is a citation to a dataset that can be used to test the model. Some of this information will accumulate through time as the model is used and may exist independent of the metadata description of the model itself. However, to the extent possible, having citations to external sources containing reviews and experiments will be very valuable for evaluation of fitness for use. Access and Constraints (CSCM section 4) Metadata needs to clearly explain how to obtain the model and all administrative and legal considerations that might limit its use. Possible constraints include cost and ownership issues. Access information includes email and mail addresses for the access person or organization, ordering procedures, and the URL for direct download, if possible. Related access and use considerations are described in the Environment elements. Environment (CSCM section 5) Both human and system environments for the model need to be explained to a potential user. Human requirements include the expertise needed to obtain, install, and run the model, and to interpret the results. System requirements include the hardware and operating system for which the model was designed and auxiliary software required. Functionality (CSCM sections 6, 7, and 8) Functionality is described in terms of input datasets, modeling constructs (parameters and variables), data processing steps, and the characteristics of the output data. This also includes any post-processing procedures required on the data. A potential user should be able to use this information to evaluate the model for fitness of use; a current user should be able to use this information to understand how to link data to the model and how the model uses the data. Eventually, a computer service should be able to use this data to evaluate candidate datasets for their suitability for use with a model. Metadata Documentation (CSCM section 10) The source of the metadata must also be documented to record the creator of the metadata, the creation and modification dates, and the metadata standard and version that was used. If the metadata was created by someone other than the model creator, it is useful to know what sources of information were used and how to contact this person in case there are corrections to make or questions to ask about the metadata itself. Definition of Elements The definitional format used by the CSCM is recognized internationally by the geospatial community. A data element is the logically primitive item of metadata. Compound data elements are called entities. They consist of groups of data elements and other compound elements. Each element and entity is defined by the following characteristics: Full Name Full names are expressive of the intent of the element and are not necessarily unique within the standard. For example, the element name "City" occurs multiple times in sections where contact information is documented. Short Name Each element and compound element is provided with a short name. These short names are unique within the standard and may be used with the Extensible Mark-Up Language (XML), Unified Modeling Language (UML), or other similar implementation techniques. A naming convention similar to that used to create the full entity and element names was used; an attempt was made to use contractions of words consistently. Definition Definitions are short and designed to show the intent of the element within the standard. Obligation / Condition This descriptor specifies whether the element must always be documented (Mandatory) or not. If the element is not Mandatory, it can be either required under specified conditions (Conditional) or provided with user discretion (Optional). Conditional obligation is specified as an electronically manageable condition: expressing a choice between two or more options; at least one possible option is mandatory and must be documented. mandatory if another element has been documented. mandatory if a specific value for another metadata element has been selected. Obligation applies within a nested set of elements. If an entity (compound element) is Optional, elements in the set that are Mandatory only apply if the entity itself is selected to be used in the description of a model. Maximum Occurrence Specifies the maximum number of instances the metadata entity or the metadata element may have. Single occurrences are shown by "1"; repeating occurrences are represented by "N". For an entity (compound element), the maximum occurrence specifies the repeatability of the set of elements as a whole. Data Type Information about the values for the data elements includes a description of the type of the value and a description of the domain of the valid values. The type of the data element describes the kind of value to be provided. The choices are integer for integer numbers, real for real numbers, text for ASCII characters, date for day of the year, and degrees for latitude and longitude coordinate values. Compound elements have the data type compound. If values are limited to a code list included with the Standard, the data type is class and the list is identified in the Domain descriptor. Domain The domain describes valid values that can be assigned to the data element. The domain may specify a code list of valid values or restrictions on the range of values that can be assigned to a data element. If the values are to be entered according to a published standard, the standard reference is specified here (e.g., ISO 8601 for dates). In these circumstances and others, there will still be a need to provide "best practices" to further explain the conventions adopted for data entry. The domain also may note that the domain is free from restrictions, and any values that can be represented by the type of the data element can be assigned. These unrestricted domains are represented by the use of the word free followed by the type of the data element. Some domains can be partly, but not completely, specified with code lists. In these circumstances, the list includes the option of other as a valid value. When other is available for selection, a conditional element is provided where the other value can be entered. In cases where domain values can be selected from external sources (e.g., from a classification scheme or thesaurus), compound elements are used to document the source of the terminology or classification notation. For compound elements, the domain specifies the section and line numbers of the elements that make up the compound description. Code Lists This standard provides five code lists to be used for all metadata using this standard. Although the limiting factor of requiring the use of code lists will potentially restrict the document creator, it has several advantages. In the search and recovery of a model, a limited vocabulary for key elements allows one to find models related to a general theme, narrowing the search for specific models and identifying a subset with the potential of containing models relevant to a task. Although some elements that require the use of a code list also allow for additional terms to be used (the other option), responsibility is placed on the metadata creator to select a universally understood phrase and not to duplicate a category already in the code list. The code lists are contained in an appendix to the standard so that changes can be made to them without disturbing the body of the standard. It is expected that these lists will evolve as the standard is used. CSCM for the Modeling and Digital Library Communities Digital library designers need to think in terms of all forms of scholarly knowledge. This includes more than text and data. Increasingly image and geospatially-oriented forms of information are being incorporated into DLs. Adding computational models is a natural extension and will facilitate awareness and use of modeling software for research and education. Programmatic services that "understand" the metadata of models and datasets, both existing in DLs, can begin assisting users in making good matches for experimentation. Digital library services that capture the output of modeling runs and facilitate documenting them will greatly enhance the potential re-use of model output for learning and training. The value of uniform descriptions of modeling software based on content standards will be recognized in this environment. Researchers and instructors will recognize the value of well-formed metadata for comparing and contrasting models and understanding the thought processes and system elements behind the models. Good documentation will also encourage reliable calibration and validation efforts in order for models to gain recognition as accurate re-creations of natural systems. We can expect the development of tools and services designed to ease the creation of documentation and the use of models to follow the adoption of content standards for computational models. Acknowledgements This work is funded by a grant from the National Science Foundation, the Alexandria Digital Earth Prototype Project (IIS-9817432), Smith and Goodchild, University of California at Santa Barbara. We also acknowledge the valuable discussions of the issues CSCM design with members of the ADEPT research staff and a group of professors and students who attend a half-day workshop on the UCSB campus. ADEPT staff member Tim Tierney has developed an XML metadata creation tool based on the CSCM. References Alexandria Digital Library Project. (2001). Alexandria Digital Earth Prototype (ADEPT). University of California, Santa Barbara. Available: <http://www.alexandria.ucsb.edu> [2001]. Aris, R. (1978 (reprinted 1994)). Mathematical modelling techniques. London; San Francisco: Pitman (Dover, New York). Benz, J., et al. (1997). Documentation of mathematical models in ecology; unpopular task? Ecomod, December, 1997, 1-7. Chorley, R. J., Haggett, P. (1967). Integrated Models In Geography. Worshire & London: Ebenezer Baylis & Sons Ltd. Clarke, K. (2001). SLEUTH Urban Growth Model (version 2.1). National Center for Geographic Information Analysis, Santa Barbara. Available: <http://www.ncgia.ucsb.edu/projects/gig/project_gig.htm> [2001]. Dee, D. P. (1994). Guidelines for Documenting the Validity of Computational Modeling Software (24pp ). Delft, The Netherlands: International Association of Hydraulic Research. Digital Library for Earth System Education. (2001). Homepage. Available: <http://www.dlese.org>. Frew, J., Freeston, M., Hill, L., Janee, G., Larsgaard, M., & Zheng, Q. (1999). Generic query metadata for geospatial digital libraries, Proceedings of the Third IEEE Meta-Data Conference (Meta-Data '99), April 6-7, 1999, Bethesda, MD, sponsored by IEEE, NOAA, Raytheon ITSS Corp., and NIMA. International Organization for Standardization (ISO). (2000). Geographic Information Metadata (CD 19115.3): International Organization for Standardization (ISO). Smith, T. R. (2001). Erosion Model. Available: <http://www.alexandria.ucsb.edu/doc>. U.S. Federal Geographic Data Committee. (1998). Content Standard for Digital Geospatial Metadata. Available: <http://fgdc.er.usgs.gov/metadata/contstan.html> [2001, May 11]. Appendix: Outline of Content Standard for Computational Models Copyright 2001 Linda L. Hill, Scott J. Crosier, Terrance R. Smith and Michael Goodchild

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Editorial \| Next Article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions DOI: 10.1045/june2001-hill