Volume 23, Number 7/8
Table of Contents
Ensuring and Improving Information Quality for Earth Science Data and Products
Science Systems and Applications, Inc. and NASA Goddard Space Flight Center
hampapuram.ramapriyan [at] ssaihq.com
Cooperative Institute for Climate and Satellites-North Carolina, North Carolina State University and NOAA's National Centers for Environmental Information
ge.peng [at] noaa.gov
Jet Propulsion Laboratory, California Institute of Technology
david.f.moroni [at] jpl.nasa.gov
NASA Goddard Space Flight Center and University of Maryland, Baltimore County
chung-lin.shie-1 [at] nasa.gov
Information about quality is always of concern to users, whether they are buying a car or some other consumer goods, or using scientific data for research or an application. To facilitate consistent quality evaluation and description of quality information on data products for the Earth Science community, we formally introduce and define four constituents of information quality scientific, product, stewardship and service. As requirements to ensure and improve information quality increase across government, industry and academia, there have been considerable efforts toward improving information quality during the last decade. Given this background, the Information Quality Cluster (IQC) of the Federation of Earth Science Information Partners (ESIP) has been active with membership from multiple organizations, participating voluntarily on a "best-effort" basis. This paper summarizes existing efforts on information quality with emphasis on Earth science data and outlines the current development and evaluation of relevant use cases. The IQC, with its open membership policy, is well positioned to bring together people from various disciplines and iteratively address the relevant challenges and needs of the Earth science data community. Moving forward, the IQC pledges to continue facilitating the development and implementation of data quality standards and best practices for the international Earth science community.
Keywords: Earth Science, Big Data, Stewardship, Provenance, Information Quality, Earth Science Information Partners, Use Cases, Life Cycle
Information about quality is always of concern to users, whether they are buying a car, mobile phones, food or some other consumer goods, or using scientific data for research or an application. Consistent descriptive information including the quality of individual products is very important to consumers in making informed and sound decisions to buy or use the products. In a world where nearly all observational data are now digitized, computational technologies in processing and analyzing data have made profound leaps in recent years, thus enabling an individual data user to take a foundational dataset and transform it into a large number of sub-products in relatively short order, such as derivatives, subsets, visualizations, fusions, and assimilations. This has led to a diversity of data products, stemming from homogenous foundational datasets, which continues to increase at a rate that makes consistent quality assessment, characterization, management and stewardship a growing challenge.
Quality of data products has traditionally been measured by the accuracy and uncertainty of fundamental measurements and the manner in which this is addressed (i.e., bias corrections, improved calibrations, quality flagging, uncertainty characterization, etc.) via the robustness of both upstream and downstream processing systems. However, there is increasing awareness of multiple dimensions of quality of data and derived products and information, depending on types and emphases of quality analysis and reporting metrics. For example, Miller (1996) proposed 10 dimensions of information quality defined by the needs of information customers: Relevance; Accuracy; Timeliness; Completeness; Coherence; Format; Accessibility; Compatibility; Security; and Validity. Wang and Strong (1996) have organized 15 data quality attributes into four dimensions that are important to data consumers, namely: Intrinsic (accuracy, objectivity, believability, reputation); Contextual (relevance, value-added, timeliness, completeness, appropriate amount of data); Representational (interpretability, ease of understanding, concise representation and representational consistency); and Accessibility (accessibility, access security). An overview of the research in the area of multi-dimensions of information quality can be found in Lee et al. (2002).
While business products tend to be better managed in terms of well-defined and well-analyzed criteria, scientific data products have tended to be inconsistent or more diverse in presenting information on quality. In addition, the business product management, vis-à-vis product life cycle, tends to be defined and implemented within its own company. However, a given scientific data product is often developed, produced, preserved, stewarded, and distributed by different organizations in which each organization has its own capability including domain expertise and priority in meeting user requirements. Therefore, it is extremely difficult for any single entity to unilaterally capture and accurately describe a complete set of quality information for data users, not to mention doing so in a consistent way. Regardless of the inherent challenges posed and adverse side effects inherited, this quasi ad-hoc approach is generally how information quality has been addressed since the beginning of Earth science data stewardship and curation. Arguably it is only since just the last couple of decades that the full weight of these challenges has been reasonably examined, thoughtfully characterized and communicated en masse.
The focus of this paper is on the quality of scientific data generated by Earth observation systems and that of data products and information derived from measurements. Therefore, information quality in this paper includes both data and information quality. While the data producers (usually in conjunction with a team of specific discipline science experts) are best able to assess the scientific quality of their products, conveying the information about the quality in a manner that is understandable and usable to data users is often a challenge. Thus, it is helpful to define a set of standards and "best practices" for collecting and conveying information about quality throughout the lifecycle of data products. The Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) has been established for collecting such standards and best practices and assist data providers and users in practicing them. The rest of this paper is organized as follows. In Section 2, we categorize quality of data products into four constituent aspects and formalize the definition of each constituent. In Section 3, we discuss the significant background work that has occurred over the last decade including several activities specific to Earth science. Section 4 provides an overview of the current activities of the ESIP IQC. Integral to the methodology used by the IQC for recommending and communicating standards and best practices is definition and analysis of use cases. To help extract relevant details, special emphasis is placed on a discussion of use cases in sections that follow. Section 5 provides the motivation for use cases and identifies use cases employed by the IQC for its analysis. Section 6 describes the evaluation methodology. Section 7 illustrates the analysis through details of an example use case. This is followed by conclusions and commentary on future work to be carried out by the IQC and its affiliates in Section 8. Appendix A provides a table showing a summary of 20 use cases discussed in sections 6 through 8. While the standards and best practices to be adopted have not yet been established, this paper discusses the methodology leading towards resolution and provides an open invitation for community participation.
2 Information Quality
We consider four different aspects of quality in close relation to different stages of data products in their life cycle. Activities involved in the data life cycle can be divided into four stages: 1. Define, develop, and validate; 2. Produce, assess, and deliver; 3. Maintain, preserve, and disseminate; and 4. Enable use, provide support, and service. Mapping the aspects of data product quality to these life cycle stages, we first define the scientific quality in terms of accuracy, precision, uncertainty, validity and suitability for use (fitness for purpose), which in various applications are considered paramount. The scientific quality is closely associated with the early part of the data product life cycle (stage 1). Second, for a given data product, the product quality takes the following into account: the degree to which the scientific quality is assessed and documented; how accurate, complete and up-to-date the metadata and documentation are; the manner in which the data and metadata are formatted; the degree to which the associated information including provenance are published and traceable throughout the data lifecycle. Therefore, the product quality is closely related to stage 2 of the data product life cycle. Third, stewardship quality addresses questions such as how well data are being managed and preserved. The stewardship quality is most relevant to stage 3 in the life cycle. Fourth, service quality deals with how easy it is for users to discover, get, understand, trust, and use a given data product along with its metadata, as well as ensuring that an archive has the requisite knowledge base and people functioning as subject matter experts available to help its data users. Thus, the service quality is closely tied with stage 4 of the data product life cycle. In general, we refer to all these aspects of quality together as information quality. Table 1 summarizes the mapping of the aspects of information quality and the four life cycle stages.
Table 1: Different information quality aspects, associated data product life cycle stages and responsible groups.
|Information Quality Aspect
||Life Cycle Stage
||1. Define, develop, and validate
||2. Produce, assess and deliver (to an archive or data distributor)
||3. Maintain, preserve and disseminate
||4. Enable use, provide support and service
The topic of data and information quality, in general, has received a lot of attention, as exemplified by an entire journal dedicated to this topic, namely, the Journal of Data and Information Quality (JDIQ), published by the Association of Computing Machinery (ACM). As a precursor to JDIQ, the Massachusetts Institute of Technology (MIT) launched the Total Data Quality Management (TDQM) program in 1992. Substantial results from the TDQM participants were the primary driving force in the creation of JDIQ. More than a decade later, JDIQ published an overview of data quality research and a framework for analyzing it were provided via the lens of a more contemporary context by Madnick et al. (2009), thereby defining a two dimensional framework for categorizing data quality research. The two dimensions are:
- Topics (4 categories, which are divided into 19 subcategories Data quality impact, Database related technical solutions, Data quality in the context of computer science, and Data quality in curation), and
- Methods (14 categories, e.g., Artificial intelligence, Econometrics, Empirical, Experimental, Theory and formal proofs, etc.)
Some examples of recent topics covered by JDIQ are: data governance approaches to implement corporate accountability (Weber et al., 2009), methods for replacing missing values and their evaluation using data quality measures for data warehousing and data mining (Li, 2009), a model for propagating and processing data quality in sensor data streaming environments (Klein and Lehner, 2009), assessment of metadata quality in open data portals (Neumaier et al., 2016), and challenges associated with the quality of dynamic data (Labouseur and Matheus, 2017). An overview of the evolution and growth of research on data quality over the past 20 years is given by Shankaranarayanan and Blake (2017). They analyze 972 abstracts in peer-reviewed journals and conference publications using "Latent Semantic Analysis" and arrive at a hierarchy of 20 themes and eight core research topics. They analyze how recent research in these eight core topics map into dimensions defined by Wang and Strong (1996). They note a shift in data quality research from "content to context", meaning that the research has been shifting "from measuring and assessing data quality content toward a focus on usage and context". From this analysis, Shankaranarayanan and Blake (2017) conclude that "context" represents unforeseen dimensions in data quality that provides significant opportunity for exploratory research.
The articles cited above, and the JDIQ, in general, cover a broad range of topics with basic research on data and information quality, and recent emphasis on big data, social media and corporate databases. While the broader research discussed above is certainly relevant for our purposes, there have also been significant activities in recent years concerning data and information quality pertaining to Earth science. Additionally, much of the recent work done in the domain of Earth science was driven by very specific user and stakeholder needs, rather than with the broader scope of historical and contemporaneous research on data and information quality as noted in the above JDIQ summary. Since the IQC was formed after many of these Earth science data and information quality initiatives had completed, the IQC can take advantage of the progress made in the Earth science domain. Therefore the focus of this paper is on Earth science data and information, and how their quality is being managed and represented by the data producers and repositories that serve the data to users. As a first order approach toward this domain-specific continuity, it is the intention of this paper to place a primary emphasis on relating past Earth science data and information quality initiatives to current activities and deliverables by the IQC. As a starting point in this exercise, a brief review of the work done thus far from past Earth science data and information quality initiatives is provided below:
- The Group on Earth Observations (GEO) identified the need for an internationally harmonized strategy to enable interoperability and acceptance of quality of Earth observation data at "face value". In response to this, the Committee on Earth Observing Satellites (CEOS) established and endorsed the Quality Assurance Framework for Earth Observation (QA4EO). Following four international workshops (2007, 2008, 2009 and 2011), a framework and ten key guidelines were established (QA4EO task team, 2010). Examples are provided (QA4EO, 2013) to illustrate activities that are compliant with the QA4EO guidelines. Also, the GEO Data Management Principles (DMP) Task Force (2015) has established 10 principles. One of the principles under the heading of Usability is DMP-6: "Data will be quality-controlled and the results of quality control shall be indicated in metadata; data made available in advance of quality control will be flagged in metadata as unchecked." The principles are elaborated further in a "living document" by GEO (2015), giving guidance on implementation and metrics to measure adherence to principles and resource implications.
- The standard ISO 19157:2013 was published in December 2013 (International Standards Organization, 2013). It establishes the principles for describing geographic data quality and defines a set of measures for evaluating and reporting data quality. It is useful for: 1. data producers providing information on data quality; 2. data distributors providing users data quality guidance; and 3. data users trying to decide whether or not a specific data product is suitable for their particular uses.
- NOAA National Centers for Environmental Information (NCEI) has pioneered the approach using a matrix to assess and document the maturity of individual Climate Data Records (CDRs) (Bates and Privette, 2012). The matrix defines six levels for maturity in each of the following six categories: Software Readiness, Metadata, Documentation, Product Validation, Public Access, and Utility. It provides a description, for each category, of what it means to be at various levels of maturity. Based on the CDR Maturity Matrix, EUMETSAT's CORE-CLIMAX matrix is developed to assess the maturity of product systems and contains guidance on uncertainty estimates.
- NOAA Center for Satellite Applications and Research has defined an algorithm maturity matrix to measure the quality of the developed data product (Zhou et al., 2016). The algorithm maturity matrix defines five stages of maturity levels of the product based on the state of validation, documentation, and utility of the product: Beta, Provisional, Validated (Stage 1, 2, and 3) (Reed, 2013).
- The NOAA NCEI and Cooperative Institute for Climate and Satellites North Carolina (CICS-NC) have jointly developed a Data Stewardship Maturity Matrix (DSMM) (Peng et al., 2015). This matrix provides a unified framework for assessing the maturity of measurable stewardship practices applied to individual digital Earth Science data products that are publicly available. It assesses maturity in nine categories (e.g., preservability, accessibility, data quality assessment, and data integrity), each with five, progressive, maturity levels. It provides consistent dataset quality information to users including scientists and actionable information to management.
- The National Center for Atmospheric Research (NCAR) maintains a data guide with contributions from the community at the "Climate Data Guide" web site (NCAR UCAR, 2017). This is a resource used for gathering inputs from the climate community on a variety of observational data products and models. It takes advantage of the community's expertise to provide an assessment of data products by users for the benefit of other users. Inputs can be from both data product developers and users, self-identified as either "Expert Developers" or "Expert Users". The inputs received by this community are reviewed for quality before publication. For more details see Schneider et al. (2013).
- NASA's Making Earth System Data Records (ESDRs) for Use in Research Environments (MEaSUREs) Program uses product quality checklists, which were developed in 2011 in collaboration with the MEaSUREs data providers and distributors, and adopted in 2013 (Hunolt, 2013). The product quality is considered to be a combination of scientific quality of the data and the completeness of associated documentation and ancillary information. The checklists are used to gather information on the completeness of activities needed to ensure product quality. The questions in the checklists address science quality, documentation quality, usage, and user satisfaction.
- NASA's Data Quality Working Group (DQWG), one of the Earth Science Data System Working Groups (ESDSWG), was established in March 2014. Its mission is to "assess existing data quality standards and practices in the interagency and international arena to determine a working solution relevant to the Earth Science Data and Information System (ESDIS) Project , Distributed Active Archive Centers (DAACs), and NASA-funded Data Producers." The DQWG analyzed 16 use cases pertinent to data distributed by the DAACs from the point of view of users in order to identify issues related to information quality and developed nearly 100 recommendations toward improving information quality across their data holdings. These were subsequently consolidated into 12 high-level recommendations, and 25 solutions to address these recommendations have been identified and assessed for operational maturity and readiness for implementation, with an initial focus on four "low-hanging fruit" recommendations; solutions that exist as open-source and in an operational environment were ranked as highest priority for implementation in NASA Earth science data systems environments.
The activities 1) through 5) discussed above provide principles, standards and maturity assessment frameworks, leveraging the variety and specificity of information quality artifacts to be gathered, conveyed and assessed. The activities 6) and 7) are particular implementations of presenting information quality. Activity 8) considers current practical implementations at NASA's EOSDIS DAACs to highlight issues from the information quality perspective and arrives at recommendations for a variety of implementations intended to improve the overall information quality. The use case analysis has proved beneficial to NASA Earth science data management. However, the use cases considered were specific to identifying issues within NASA's EOSDIS DAACs and recommending solutions to address them. The ESIP IQC, as discussed below, extended this approach to a broader community of data providers, adopting (and adapting) NASA's use-case study approach. In the spring of 2015, the NASA DQWG took its first steps to work closely with the ESIP IQC in sharing new use cases and co-participating in the use case evaluation process. This collaboration remains ongoing and provides a way for stakeholders outside of NASA to be directly engaged in the use case collection and evaluation process.
4 ESIP Information Quality Cluster
The ESIP is a US-based organization with international membership and "is an open, networked community that brings together science, data and information technology practitioners" (ESIP, 2016). The ESIP was founded in 1998 by NASA and currently consists of more than 180 member organizations from federal agencies, universities, and commercial as well as nonprofit entities. The ESIP initially formed the IQC in January 2011, led by Greg Leptoukh who was also taking an active role in QA4EO. With his unfortunate demise in January 2012, the IQC activities became dormant until July 2014 when it was rejuvenated (ESIP IQC, 2016). The current objectives of the IQC are to:
- Actively evaluate community data quality best practices and standards;
- Improve capture, description, discovery, and usability of information about data quality in Earth science data products;
- Ensure producers of data products are aware of standards and best practices for conveying data quality;
- Develop and disseminate recommendations for data providers/distributors/ intermediaries to establish, improve and evolve mechanisms to assist users in discovering and understanding data quality information; and
- Consistently provide guidance to data managers and stewards on how best to implement data quality standards and best practices to ensure and improve maturity of their data products.
The activities of the IQC include:
- Identification of additional needs for consistently capturing, describing, and conveying quality information through use case studies with broad and diverse applications. (See a general description of the 4 use cases developed and an example of one use case evaluation in Sections 5 and 6, respectively);
- Establishing and providing community-wide guidance on roles and responsibilities of key players and stakeholders including users and management (e.g., Peng et al., 2016);
- Prototyping of conveyance mechanisms to ensure that information quality is properly disseminated to users in a more consistent, transparent, and digestible manner;
- Establishing a baseline of standards and best practices for data and information quality;
- Evaluating recommendations from NASA's DQWG in a broader context and proposing actionable implementations that extend beyond the NASA realm; and
- Engaging data providers, data managers, and data user communities as partners to improve our standards and best practices.
While progress has been made in all 6 areas, the present focus of the IQC is on developing and analyzing use cases beyond the 16 use cases considered by the NASA DQWG mentioned above. The following sections provide a brief discussion of the use case analysis.
5 Use Cases
In the context of this paper, the term "use cases" is employed to describe activities using Earth science data offered by data centers. Such activity descriptions are useful in highlighting any difficulties encountered by users in locating, understanding and/or using the information about the quality of data products. Analysis of use cases leads to one or more recommendations that may be implemented by data producers and data archivists/distributors to address the users' needs and resolve the difficulties. A diverse set of use cases covering a broad spectrum of data applications is therefore beneficial in improving information quality.
In the work of the NASA DQWG, described in Section 3 above, it became clear from the outset that, given how the data quality is currently being represented in various datasets held at the DAACs, it would be useful to start a knowledge base of the needs and challenges of conveying data quality information to users through a set of use cases. The use cases were needed to address datasets offered (even if only represented by a small subset) by the DAACs and had to cover a broad class of users. A total of 16 use cases were defined, and information about each use case was captured using a template. For the purposes of addressing the needs within NASA EOSDIS, the DQWG defined a use case structure that included the objectives/goals, context of the targeted user's data/information application, targeted user/stakeholder characterization, scope, chronology of use case elements (i.e., activities constituting the use case), and success criteria. Borrowing from the NASA EOSDIS use case template, the IQC followed a similar methodology for capturing additional use cases covering the broader Earth science data community represented by the members and affiliates of the IQC. Given the overlap in membership between the NASA DQWG and the ESIP IQC, the 16 DQWG use cases and their analyses were available to the IQC. The IQC developed four additional use cases in early 2016, capturing all of the items in the DQWG template identified above as well as additional details (i.e., rationale for scope, keywords, domain of interest, professional domain of user, user-stakeholder relationship, and contact information for the use case author).
Appendix A provides a tabular summary of all 20 use cases including those from the DQWG, showing the key issue(s) identified, the primary technical and science themes addressed, the information quality (IQ) aspect of the issue(s) and the data quality management phase(s) (DQMP, defined in Section 6 below) addressed by the recommendations resulting from the use case analysis. The titles and short narratives for the four use cases defined by the IQC are given below:
Use Case 1: Dataset "Rice Cooker" Theory (Shie, 2011) An integrated data product contains inputs from both ancillary and science data sources, of which the sources of science data come from multiple sensors/platforms/projects. A user will need to understand the quality of the integrated data product before using it. However, this is difficult due to the multiple inputs contributing to the product, possibly derived from diverse sources such as satellite, in situ, airborne, and model/reanalysis data. A related issue is the communication to end users regarding any changes to input data. Examples of changes to input data include: statistical changes, quality flag implementations, data gaps, etc. Such changes can affect the quality of the final product even though the final product algorithm does not change. Note: the allusion of the "Rice Cooker" here is to illustrate the dependence of the quality of the end product (of cooking with multiple ingredients) on the quality of inputs.
Use Case 2: Appropriate Amount of Documentation for Data Use Traditionally, information about a data product can be captured and provided to end-users in the form of collection metadata (e.g., via a Digital Object Identifier (DOI) landing page) and a descriptive document such as an Algorithm Theoretical Basis Document (ATBD). Different types of users need different levels of detail in the documentation, depending on their application requirements. Some end-users may find the detailed documentation difficult to understand and may just need information about how confident they can be about the adequacy of the data product for their purposes. Others may delve deeper into the scientific basis of the data product, understand complete details and want to make their own improvements to the product. The goal is to define levels of recommended documentation based on the data usage requirements to lead end-users to the documents at a level matching their needs and to help data centers with the types of documentation they should prepare.
Use Case 3: Understanding and Identifying Datasets using Santa Barbara Coastal Long Term Ecological Research (SBC LTER) Data Portal Using the SBC LTER's "Browse/Search Data" capability (SBC LTER, 2006), a scientist/researcher is able to identify the dataset(s) of interest from pre-defined collections based on the metadata/documentation provided and to retrieve the associated data file(s) accordingly.
Use Case 4: Citizen Science A researcher working with the public (typically some community or a group of people with an interest in the researcher's topic) is accumulating data and information acquired by those people. Depending on the people gathering the data and information, the associated quality may vary; consequently, the researchers need mechanisms to allow them to know what data and information to trust.
Table 2 shows the coverage by the 20 use cases of the IQ aspects and the DQMP. The rows in this table correspond to IQ aspects and columns to DQMP. The numbers within the table indicate the number of use cases addressing a given combination of IQ aspects and DQMP. It is noted that the use cases cover all combinations of IQ aspects and DQMP, even though a few combinations are covered by a larger number of use cases than others.
Table 2: Numbers of use cases addressing combinations of IQ aspect (rows) and DQMP (columns).
6 Evaluation Methodology
The purpose of the use case analysis is to:
- Identify the issues and challenges recognized by the stakeholder;
- Determine why these issues and challenges exist;
- Define the success criteria for resolution (i.e., the steps required to rectify the issues and challenges). The success criteria collectively indicate the idealized workflow, which assume that all of the requisite solutions exist to satisfy the criteria of each use case, of how the issue would be resolved;
- Recommend what needs to be done by the data system and/or the data producers in order to meet the success criteria established when the use case was formulated; and
- Identify all known, working solutions to address the issues and challenges.
The recommendations that arise from the evaluation of use cases are grouped into phases in the management of information on data quality (a.k.a. Data Quality Management Phases DQMP). These are: 1) Capturing (i.e., deriving, collecting and organizing the information), 2) Describing (i.e., documenting the information for public consumption), 3) Facilitating discovery (i.e., publishing and providing access to the information), and 4) Enabling use (i.e., enhancing the utility of the information).
To facilitate evaluations, a template was developed, calling for entries in the following fields: Use case title, Author, Champion (note: this may or may not be the same as the author, yet this person is responsible for leading the evaluation and ensure proper continuity of intent by the author), Data quality information management phase, Recommendation category, Relevant success criteria, Recommendations for data producer(s), data distributor(s), and user(s), Existing solution(s), Level of maturity/reusability of existing solutions, Organizations with existing solutions, and Justification for recommendations. An example of such an evaluation is given in the next section.
7 Evaluation of an Example Use Case
The evaluation of the use case titled "Appropriate Amount of Documentation for Data Use" is provided below as an example.
- Use case title: Appropriate Amount of Documentation for Data Use
- Author: Ge Peng
- Champion: Ge Peng
- Key Issue: Not all types of users need the same level of detail in documentation.
- DQMP: Capture, Describe, Facilitate Discovery, Enable Use
- Recommendation Category: Relevance to Application, Fitness for Intended Usage, Discoverability
- Relevant Success Criteria:
- Quantitative: Amount of time the user spends finding/determining what they need.
- Qualitative: Positive feedback from the user that they are satisfied with the use and information provided.
- Data Producers: Provide a publicly accessible Algorithm Theoretical Basis Document (ATBD) including data and processing flow diagrams and error estimates.
- Data Distributors:
- Provide publicly accessible data documentation in tiers: DOI landing page, User's Guide, ATBD.
- Provide a filtering method for users to decide how to find product/necessary documentation (a la Amazon shopping) based on product/data center-relevant criteria.
- Provide a feedback mechanism for users.
- Work with producers to create consistent User's Guides.
- Users: Provide feedback via mechanism established by data distributors.
- Existing Solution(s): Some tiered documentation is available; faceted searches have been implemented in some cases to help the users get started.
- Level of maturity/reusability of existing solutions: Prototype implementations have been deployed by one or more agencies/institutions.
- Organizations with existing solutions: NASA/ESDIS, NOAA/NCEI
- Justification for recommendations: Defining and communicating levels of recommended documentation will help guide end-users on what document to look for to get appropriate information commensurate with their needs, help data centers establish document templates that they need to curate, and help data providers supply information needed for data stewardship and use/service.
8 Conclusion and Future Work
In any field, the quality of products is always of concern to users. This paper provides a brief description of the efforts in various national and international agencies and programs in the area of data and information quality in general, yet with a primary focus on Earth science data. We formalize definitions of four aspects of information quality within the context of the Earth science data product life cycle. We also define four phases of data quality management. The ESIP IQC is making progress in several activities pertaining to information quality, with contemporaneous efforts concentrated on defining and analyzing use cases. Most of the use cases are based on typical tasks that users perform in accessing, understanding and using data from existing archives, while a few use cases consider data producers submitting data to archives. The primary purpose of use cases is to identify issues and arrive at recommendations for improvements. The recommendations can be actions to be performed by producers, distributors (archives) or users of data and information. The use cases are broad in scope, and jointly cover all four constituents of information quality and all four data quality management phases. The breadth of use cases is enhanced by a combination of those defined by NASA's DQWG and by participants in the IQC representing other organizations.
Use case collection and evaluation is an ongoing activity that is largely dependent on volunteers (e.g., researchers, educators, citizen scientists, data curators, decision makers, and technologists) willing to directly interact with the ESIP IQC. While the ESIP IQC represents a broad subset of the Earth science data user community, the existing IQC membership is not necessarily a complete representation of the diversity of needs and applications of Earth science data, and consequently, information quality. We, however, hope this paper will ignite a spark that will lead to increased attention and interest attracting: 1) more community involvement in publishing work in this field that has yet to reach the public domain, and 2) more volunteers from diverse backgrounds in Earth sciences to join the IQC through an open membership policy.
In addition to continued work with use cases, the following actions need to be pursued to fully realize IQC's potential in ensuring and improving information quality for Earth science data and products:
- Maintain an evolving inventory of use cases, identified issues and recommended actions.
- Develop a catalog of implemented solutions that address issues identified through use cases.
- Share recommendations via a multi-lateral feedback mechanism with data producers, distributors and users.
- Assess how the recommendations from the IQC map to existing standards such as ISO 19157:2013 (International Standards Organization, 2013) and the GEO DMP implementation guidelines (GEO DMP Task Force, 2015).
- Assess how the implementation of "use cases" discussed in GEOSS Tutorial (GEOSS, 2012) map to recommendations from the IQC use cases.
- Maintain a bibliography on information quality applicable to Earth science data.
Following the principles of openness of the ESIP Federation, the IQC invites all individuals interested in improving capture, description, discovery, and usability of information about data quality in Earth science data products to participate in its activities.
This work was a result of the authors' participation in the ESIP IQC. They would like to thank the members of the IQC for their contribution to the discussions at the cluster meetings as well as comments on a draft of this paper. Specifically, the use cases in the paper are summarized from the more detailed versions authored by Robert R. Downs (Columbia University) and Chung-Lin Shie, Ge Peng, Margaret O'Brien (UCSB) and Sophie Hou (NCAR), and Ruth Duerr (Ronin Institute). The evaluation of use case 2, summarized above, was conducted by Ge Peng, Stephen Olding (Science Systems and Applications, Inc.) and Lindsey Harriman (SGT, Inc.). Ramapriyan's work was supported by NASA under a contract with Science Systems and Applications, Inc. Peng's work was supported by NOAA under a grant with CICS-NC. Moroni's work was supported by NASA under a contract with the Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA. Shie's work was supported by NASA funding to the University of Maryland, Baltimore County. Government sponsorship is acknowledged. Any opinions, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of their employers or funders.
||Bates, J J and Privette, J L. 2012. A maturity model for assessing the completeness of climate data records, EOS. Transactions of the AGU, 93 (44):441. https://doi.org/10.1029/2012EO440006
||ESIP. 2016. ESIP Federation.
||ESIP IQC. 2016. ESIP Information Quality Cluster.
||GEO. 2015. Data Management Principles Implementation Guidelines. GEO-XII 11-12 November 2015, Document 10.
||GEO DMP Task Force. 2015. GEOSS Data Management Principles.
||GEOSS. 2012. Data Quality Tutorial for GEOSS Providers.
||Hunolt, G. 2013. Product Quality Metrics: Project and DAAC Checklists.
||ISO. 2013. ISO 19157:2013 Geographic information Data quality.
||Klein, A and Lehner, W. 2009. Representing Data Quality in Sensor Data Streaming Environments. ACM Journal of Data and Information Quality (JDIQ), 1(2), September 2009. https://doi.org/10.1145/1577840.1577845
||Labouseur, A G and Matheus, C C. 2017. An introduction to dynamic data quality challenges. ACM Journal of Data and Information Quality (JDIQ), 8(2), January 2017. https://doi.org/10.1145/2998575
||Lee, Y W et al. 2002. AIMQ: a methodology for information quality assessment. Information & Management, 40, 133-146. https://doi.org/10.1016/S0378-7206(02)00043-5
||Li X. 2009. A Bayesian Approach for Estimating and Replacing Missing Categorical Data. ACM Journal of Data and Information Quality (JDIQ), 1(1), June 2009. https://doi.org/10.1145/1515693.1515695
||Madnick S E et al. 2009. Overview and framework for data and information quality research. ACM Journal of Data and Information Quality (JDIQ), 8(1), June 2009. https://doi.org/10.1145/1515693.1516680
||Miller, H 1996. The Multiple Dimensions of Information Quality. Information Systems Management, 13(2):79. https://doi.org/10.1080/10580539608906992
||NCAR UCAR. 2017. Climate Data Guide.
||Neumaier S, Umbrich J and Polleres S. 2016. Automated quality assessment of metadata across open data portals. ACM Journal of Data and Information Quality (JDIQ), 8(1), October 2016. https://doi.org/10.1145/2964909
||Peng G et al. 2015. A unified framework for measuring stewardship practices applied to digital environmental datasets. Data Science Journal, 13:231. https://doi.org/10.2481/dsj.14-049
||Peng G et al. 2016. Scientific Stewardship in the Open Data and Big Data Era Roles and Responsibilities of Stewards and Other Major Product Stakeholders. D-Lib Magazine, 22(5/6). https://doi.org/10.1045/may2016-peng
||QA4EO. 2013. Case Studies.
||QA4EO task team. 2010. A Quality Assurance Framework for Earth Observation: Principles, Version 4.0, January 14, 2010.
||Reed B. 2013. Status of Operational Suomi NPP Algorithms.
||SBC LTER. 2006. Santa Barbara Coastal Long Term Ecological Research, Browse or Search Data.
||Schneider D P et al. 2013. Climate Data Guide Spurs Discovery and Understanding, Eos Trans. AGU, 94(13):121. https://doi.org/10.1002/2013EO130001
||Shankaranarayanan G and Blake R. 2017. From content to context: The evolution and growth of data quality research. Journal of Data and Information Quality (JDIQ) 8(2), January 2017. https://doi.org/10.1145/2996198
||Shie C-L. 2011. Science background for the reprocessing and Goddard Satellite-based Surface Turbulent Fluxes (GSSTF2c) Data Set for Global Water and Energy Cycle Research. Science Document for the Distributed GSSTF2c via Goddard Earth Sciences (GES) Data and Information Services Center (DISC).
||Wang R Y and Strong D M. 1996. Beyond accuracy: What data quality means to consumers. Journal of Management Information Systems 12(4):5. https://doi.org/10.1080/07421222.1996.11518099
||Weber K, Otto B and Osterle H et al. 2009. One Size Does Not Fit All A Contingency Approach to Data Governance, Journal of Data and Information Quality (JDIQ), 1(1), June 2009. https://doi.org/10.1145/1515693.1515696
||Zhou L H, Divakarla M, and Liu X P. 2016. An Overview of the Joint Polar Satellite System (JPSS) Science Data Product Calibration and Validation. Remote Sensing, 8(2). https://doi.org/10.3390/rs8020139
Appendix A Summary of the 20 Use Cases
|Use Case Number*
||Primary Technical Theme
||IQ Aspect relevant to issues**
||Quality Management Phases addressed by recommendations***
||Multiple inputs contributing to the product, derived from diverse sources. Some inputs may be of unknown/undocumented data quality.
||1, 2, 3, 4
||Not all types of users need the same level of detail in documentation.
||1, 2, 3, 4
||User needs to identify and retrieve dataset(s) of interest from predefined collections based on the metadata/documentation provided.
||1, 2 & 3, 4
||Varying quality of data depending on people collecting them.
||1, 2, 3, 4
||Guidance regarding how to use already available quality indicators.
||1, 2, 3
||1, 2, 3
||Large differences between buoy and satellite-derived data.
||1, 2, 4
||Need for a service to apply specific quality filtering levels or flags while extracting data values from a file.
||1, 2, 3, 4
||Selecting the most relevant and useful datasets among those containing similar geophysical parameters.
||Users need to know error propagation as higher level products are generated.
||1, 2, 3
||Use of data outside "normal" spatial coverage area.
||2, 3, 4
||Geometric error in land mask.
||1, 2, 3
||Guidance to Principal Investigators about proper level of data quality documentation.
||Conformance of netCDF or HDF files (granules) to the Climate Forecast (CF) and Attribute Convention for Dataset Discovery (ACDD) metadata models.
||Quality flag that marks questionable ice values, rather than filtering out such values.
||1, 2, 3
||Need improved identification and characterization of outliers.
||1, 2, 3, 4
||1, 2, 3
||Provide sufficient information to users such that they can judge and replicate our products.
||1, 2, 3
||Insurance company trying to assess the coastal region that is vulnerable to storm surge finds that only limited types of data available.
||2, 3, 4
||1, 2, 3, 4
||Need to know how much of a pixel is comprised of specific spaceborne sensor inputs and /or in situ measurements.
||1, 2, 3
||User needs data with spatial resolution under 10 km and maximum data coverage with minimal data dropouts.
||2, 3, 4
||1, 2, 3, 4
||Accuracy of product documentation vs. provisional product contents.
||1, 2, 4
* Use cases 1 through 4 were the new ones defined by ESIP IQC; 5 through 20 were carried over from NASA DQWG.
** 1 = Science; 2 = Product; 3= Stewardship; 4 = Service
*** 1 = Capture; 2 = Describe; 3 = Facilitate Discovery; 4 = Enable Use
About the Authors
Hampapuram Ramapriyan is a Research Scientist/Subject Matter Expert at Science Systems and Applications, Incorporated (SSAI). He supports the Earth Science Data and Information System (ESDIS) Project at NASA Goddard Space Flight Center, which is responsible for archiving and distributing most of NASA's Earth science data. Prior to his employment with SSAI, he was the Assistant Project Manager of the ESDIS Project. An active member of the Federation of Earth Science Information Partners (ESIP) since its inception in 1998, he is currently a member of its Data Stewardship Committee and Chair of the Information Quality Cluster.
Ge Peng has over twenty years of technical experience in assessing and monitoring quality of Earth Science observational systems, data products and model output. Dr. Peng currently supports NOAA's Climate Data Record Program and OneStop Project. She is an active member of the Federation of Earth Science Information Partners (ESIP) a member of its Data Stewardship Committee and co-chair of Information Quality Cluster, where she leads the effort in defining roles and formalizing responsibilities of major product key players and stakeholders for ensuring data quality and improving usability, in collaboration with NOAA's National Centers for Environmental Information.
David Moroni has 9 years of experience in managing and providing technical support for data in the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC) at the Jet Propulsion Laboratory, conducting and facilitating research on characterizing data uncertainty using geospatial derivative techniques, developing/testing Earth science data-centric software and technologies, utilizing and deploying data/metadata and software interoperability standards, and implementing and developing best practices in data stewardship. Mr. Moroni currently serves as chair and co-chair (respectively) of the NASA Earth Science Data Systems Working Groups (ESDSWG) Data Quality Working Group (DQWG; 2014-present) and ESIP IQC (2015-present).
Chung-Lin Shie has been a member of the research faculty at University of Maryland, Baltimore County since 2001. He has worked on various scientific subjects such as Global Water and Energy Cycles, Air-Sea Interaction, Cloud and Hurricane Simulations. He has been serving as Project Scientist of the Goddard Earth Science Data and Information Services Center (GES DISC) since 2013, where he provides scientific suggestions aiming to improve its data services ensuring data quality and enhancing science applications. He has also participated several Working Groups organized by the Earth Science Data and Information System (ESDIS) Project and the Federation of Earth Science Information Partners (ESIP).