D-Lib MagazineSeptember/October 2012 The Data Conservancy Instance: Infrastructure and Organizational Services for Research Data Curation
Matthew S. Mayernik AbstractDigital research data can only be managed and preserved over time through a sustained institutional commitment. Research data curation is a multi-faceted issue, requiring technologies, organizational structures, and human knowledge and skills to come together in complementary ways. This article provides a high-level description of the Data Conservancy Instance, an implementation of infrastructure and organizational services for data collection, storage, preservation, archiving, curation, and sharing. While comparable to institutional repository systems and disciplinary data repositories in some aspects, the DC Instance is distinguished by featuring a data-centric architecture, discipline-agnostic data model, and a data feature extraction framework that facilitates data integration and cross-disciplinary queries. The Data Conservancy Instance is intended to support, and be supported by, a skilled data curation staff, and to facilitate technical, financial, and human sustainability of organizational data curation services. The Johns Hopkins University Data Management Services (JHU DMS) are described as an example of how the Data Conservancy Instance can be deployed. 1 IntroductionData management and curation feature prominently in the landscape of twenty-first century research. Digital technologies have become ubiquitous for the collection, analysis, and storage of research data in all disciplines. Digital research data, if curated and made broadly available, promise to enable researchers to ask new kinds of questions and use new kinds of analytical methods in the study of critical scientific and societal issues. Universities, research organizations, and federal funding agencies are all promoting the re-use potential of digital research data (Long-Lived Digital Data Collections, 2005; Changing the Conduct of Science in the Information Age, 2011). Information institutions, such as libraries and data centers, are in a position to lead data management and curation efforts by developing tools and services that enable researchers to manage, preserve, find, access, and use data within and across institutions and disciplines (Walters & Skinner, 2011; UNC-CH, 2012). This article provides an overview of Data Conservancy, and outlines the Data Conservancy Instance, an infrastructural and organizational data curation solution for research institutions. Data Conservancy is a community promoting data preservation and re-use across disciplines with tools and services. 1.1 What is the Data Conservancy? The Data Conservancy is a community organized around data curation research, technology development, and community building (Treloar, Choudhury, & Michener, 2012). Initially funded by the National Science Foundation's DataNet program, the Data Conservancy is headquartered at the Sheridan Libraries, Johns Hopkins University. Data Conservancy community members include university libraries, national data centers, national research labs, and information science research and education programs. The Data Conservancy community is driven by a common theme: the need for institutional solutions to digital research data collection, curation and preservation challenges. The four main activities of the Data Conservancy are:
This work, which is ongoing, led to the development of the concept of a Data Conservancy Instance. 1.2 The Data Conservancy Instance Data curation solutions for research institutions must address both technical and organizational challenges. A Data Conservancy Instance is an installation of the Data Conservancy technical infrastructure embedded within a particular organization. Individual DC Instances are shaped by these considerations:
1.3 DC Instance service offerings and value The Data Conservancy Instance has a number of features that are common to digital library systems, such as ingest, storage, and search/browse, and some features that are unique to the Data Conservancy Instance. Core features and values of a DC Instance include:
2 Data Conservancy Technical FrameworkThe technical infrastructure of a Data Conservancy Instance consists of the Data Conservancy software stack and appropriate web server and data storage hardware. 2.1 Software infrastructure The Data Conservancy software stack is the core technical component of a DC Instance. The development of the DC software stack has used the Open Archival Information System (OAIS) reference model as a guide (CCSDS, 2012). The OAIS model gives definitions, semantics, considerations, responsibilities, requirements, functional entities, and important actors for an open archival system. 2.1.1 Software architecture The DC software architecture contains four layers. Each layer can communicate with the layers above or below it, but, according to the design of the stack, communications cannot skip layers. Certain layers require external components to be installed, as discussed below. 1st layer Application layer The application layer consists of DC-owned and created applications that access specific services through the APIs. Examples of the application layer services include the user interfaces and the batch loading application. External entities, such as organizations and services that use the DC from outside of the stack, are considered to be applications, as they can invoke ingest and access layers via the API layer. 2nd layer API layer The DC software provides a set of APIs that may be invoked by clients (either human users or other programs). All DC system functions communicate with the DC software services through the APIs. The API layer provides the specifications for how ingest and search-and-access services are accessed and invoked. The APIs are invoked via HTTP requests, such as GET, POST, etc. The purpose of the APIs is to insulate clients from the complexities of the internal system, allowing the system to evolve without requiring clients to change the ways that they use or invoke the Instance features. 3rd layer Services The services layer consists of services that are invoked as needed by the applications via the APIs. These services include ingest, indexing, and search-and-access. The services in this layer are designed to be modular. Potentially, they could be extracted and applied in another context. The DC services are distributed as a Java web application, while the search functionality uses the Apache Solr software. 4th layer Archiving The archival storage API is the interface to the archival services, and is used to deposit data into the archive and to bring data from the archive to the users. The archival layer can be implemented using a range of archival storage frameworks, but the implementation recommended by the DC development team uses the Fedora Commons Repository Software. 2.1.2 Interaction through public Web service APIs There are two principal APIs providing the interface to the underlying services: Ingest and Search-and-Access. Data requests are fulfilled by the appropriate API transferring data to and from the archival services. Because the APIs work via HTTP requests, they also allow external services to connect to the DC Instance. Thus, both internal and external services interact with the DC ingest, search-and-access, and archival services in the same manner. This section only provides a high-level overview of the APIs; please see the Data Conservancy web site for more detailed API documentation. Ingest service and API Search-and-access service and API External service integration Another pilot project allowed the National Snow and Ice Data Center (NSIDC, Boulder, CO) to use the JHU DC Instance APIs to discover and include within their own system metadata about a number of glacier images held in a collection about volcanism in the Dry Valleys of Antarctica. The NSIDC Glacier Photo Collection periodically harvests the images from the DC Instance and updates their collection if any changes are made to the photos. The NSIDC uses the search-and-access APIs, which allow "read-only" functions. 2.2 Hardware infrastructure The Data Conservancy software stack is hardware agnostic. The DC community is developing primarily Linux-based Instances, but the DC software stack can be installed on any hardware and operating system combination that supports a Java Runtime Environment. Fedora, the recommended archival storage framework, is also a hardware-agnostic Java-based system. The hardware requirements for a DC Instance will depend on the scope of the data curation services being implemented. For example, data volumes vary from project-to-project and discipline-to-discipline. Correspondingly, the amount of hard disk drive (HDD) and tape storage space needed for any particular instance will vary depending on the data to be managed and curated. As such, the Data Conservancy software has no explicit hardware requirements. DC Instances can be installed with varying amounts of RAM space, storage space, and processor speeds. Hardware requirements should be assessed prior to installing an Instance, and can be assessed on an ongoing basis as the Instance's performance is evaluated and any bottlenecks are identified. 3 Organizational FrameworkA Data Conservancy Instance is designed to be set within an organizational structure, with skilled staff involved in the installation, deployment, and ongoing upkeep of the Instance technology and services. 3.1 Staffing and skills Skilled staff are central to any data curation service. Staffing needs for a DC Instance may change over time depending on the kinds of services that are developed around the system, but a typical DC Instance will require particular types of staff and skill sets.
Depending on the staffing and budget situation in which a DC Instance is deployed, multiple roles might be consolidated into a single person or embedded into other existing positions within the organization, in particular the instance administrator and data management consultant roles, and the system administrator and software developer roles. Many of the skills that these roles require are already part of library and computing departments. For example, the data management consultation process emulates a reference interview (Garritano & Carlson, 2009). Data management consultants need to gather information about a researcher's data management needs, identify gaps in current plans and practices, help the researchers understand their data management options, and then help researchers to prepare and iterate on their data management plan. As with any reference work, consultants should adapt their recommendations and help to research timeframes and deadlines. The more direct interactions that data management consultants have with researchers, the more they can encourage systematic change of academic data management cultures. 3.2 Organizational structure The Data Conservancy Instance has been developed as a research support service. Research libraries are the main type of institution that has expressed interest in Data Conservancy services, but an Instance is not restricted to being a library service. There might be another group within a research institution who supports an Instance, such as an academic computing group, or two or more units may work together within a single institution to support an Instance. Each DC Instance needs to define its own collection policy. Collection policies will vary depending on the institutional context in which an Instance is situated. Important considerations include the communities to be served by the Instance, the scope of the collections to be included, the data types that will be supported, the criteria for inclusion, how and when to reappraise data collections, and the levels of service that will be provided. In addition, an Instance should define data retention policies that outline levels of support and commitment to maintaining data over time. A data retention policy does not require that the Instance commit to open-ended support for data; a policy can specify that data retention procedures revolve around defined projects and time-periods. Collection policies and data retention policies help to guide data deposition agreements, and the management and curation of the data resources over time. 4 Sustainability StrategiesData can only be managed and preserved over time through a sustained institutional commitment. Ensuring the sustainability of data curation efforts is a multi-faceted issue, requiring sustainable technologies, sustainable financial structures, and sustainable ways of ensuring the continuity of human knowledge and skills (Lavoie, 2012). 4.1 Technical sustainability Technical sustainability has been an important consideration throughout the DC software design and development process. Data Conservancy technical sustainability arises from:
4.2 Financial sustainability In order to ensure that data resources are accessible and usable in the future, data curation services must be backed-up by a sustainable financial model. Sustainable cost models for data curation services are not yet well understood; different data curation institutions have different financial models. Financial sustainability is, however, obviously interconnected with technical and human sustainability issues. Successful implementation and operation of any digital data curation service will require a thorough analysis of all known or expected costs for the immediate future, coupled with strategies for continuing to cover those costs in sustainable ways. 4.2.1 Cost categories The three main costs of running a data curation solution are:
4.2.2 Financial models and strategies Different organizations use different financial models to support data curation services. These models range from direct funding from national governments to fee-for-service models. Organizations commonly utilize multiple sources of funds (Lavoie, 2012). The effectiveness and sustainability of any financial model must be continually evaluated and pro-actively managed so that adjustments can be made. Funding models include:
4.3 Human sustainability Human sustainability is critical to ensuring continuity and consistency of data curation services over time. Staff develop knowledge and day-to-day practices for working with researchers, creating and implementing data management plans, and working effectively with technical systems. Because data collections are so variable, the most effective data management and curation environments are those that allow for cross-pollination of expertise, practices, and skills among staff members. Sharing of expertise plays a central role within the ongoing operation and development of any data curation solution. Over time, data management consultants will develop their own expertise working with the DC Instance, and will be able to provide training to new users and staff. Human sustainability feeds into technical sustainability. Each DC Instance has a designated "product owner" to serve as the prime contact and liaison to the broader DC community. The product owners group evaluates the current system and identifies development needs and functionalities to prioritize. Product owners should also feed new or customized DC services, such as new search interfaces, feature extraction profiles, or other API-based services, back to the fuller DC community. As the DC community grows, new Instance product owners will be invited to join in the iterative feedback process. 5 Case Study: Johns Hopkins University Data Management ServicesThe Johns Hopkins University Data Management Services (JHU DMS) are the first example of how the DC Instance can be deployed. DMS has a full DC Instance installed, with the DC software stack running on a local server. The DMS went live in July of 2011. The JHU DMS website can be found at http://dmp.data.jhu.edu/. 5.1 DMS services and financial model JHU DMS is supported by Deans of schools within JHU that receive NSF research grants. The JHU DMS was proposed as a research support service to fill a need newly opened up by the NSF data management planning requirement (NSF, 2012a), but the scope of DMS has since expanded to other funding sponsors. The DMS proposal included a rationale for the DMS, the scope of the services the DMS would provide, and a budget for those services. As proposed (and now implemented), the DMS provides two services: 1) consultative services for researchers who are writing a data management plan for an research proposal, and 2) post-award services for researchers who receive a funding award, including developing data management procedures as a grant proceeds, and helping researchers to actually deposit their data into the DMS DC Instance. The pre- and post-award DMS services are financially distinct. The pre-award data management planning consultations are supported directly by JHU Deans, while post-award DMS services are written into proposal budgets by the researchers who wish to work with DMS after the grant is received. Thus, post-award DMS fees are charged to the individual grants that are being supported. As these two distinct financial models suggest, researchers can work with DMS consultants pre-award (to create a data management plan for a proposal) without working with DMS consultants post-award (to actually archive their data in the DMS DC Instance). 5.2 DMS staffing and operational activities The DMS staff includes two data management consultants, a software developer, and a services manager who oversees the overall operation of the DMS and serves as the DMS representative in the DC Instance product owners group. A system administrator position is in the process of being filled. These positions were all part of the DMS proposal, and the two data management consultant positions were filled after the JHU Deans endorsed the DMS and agreed to provide financial support. Domain specificity was not part of the data management consultant job search, but domain expertise was. The two initial data management consultants hired by DMS have graduate degrees in both library/information science and a science or social science domain. This domain expertise has proven to be very valuable in rolling out the DMS services. The DMS data management consultants also serve as the DC Instance administrators, administering user accounts and consulting on the use of the Instance. 5.3 Initial DMS challenges All DC Instances will encounter challenges unique to their own environment, but the initial months of the DMS indicate the types of challenges that other Instances may face. Because the DMS is a cross-disciplinary service, responding to widely ranging domains requires flexibility and awareness. For example, effectively timing consultative support in order to meet grant submission deadlines can be challenging. Grant-writing and other deadlines vary widely from project-to-project, within and across disciplines. Data management consultants must be aware of the deadlines and prioritize work accordingly. Another cross-disciplinary challenge is the lack of common vocabulary for data management activities. One PI's "storage" is another PI's "archiving." (Choudhury, 2012) Navigating different data retention policies is another challenge, as different kinds of data have different data retention needs, and funding bodies vary in their retention policies. Working within the boundaries of the NSF data management planning policy is in-and-of-itself a challenge. Condensing key data management planning information into two pages regardless of the size of the project or expected data complexity is a notable constraint. Finally, marketing the DMS is necessary but difficult. Building a usage base requires building awareness of the value of data management and curation, and of the DMS itself. 6 ConclusionThe Data Conservancy Instance provides data curation infrastructure for data management, discovery, and integration across disciplines. While comparable to institutional repository systems and disciplinary data repositories in some aspects, the DC Instance has capabilities beyond what either institutional repositories or disciplinary data repositories provide. Discipline-specific and institutional repositories address the data curation requirements of particular data communities, but in doing so often create the data "silo" problem. Each disciplinary repository is an independent silo of data with little ability to connect to other repositories (Salo, 2010). Determining data relations across repositories is a difficult task. Bringing data together from multiple repositories requires knowing that multiple potentially related repositories exist, searching each repository individually, and compiling data sets manually. The Data Conservancy Instance contains a number of unique features that set it apart from discipline-specific or institutional repository systems:
With these features data integration, external interoperability, and a discipline-agnostic infrastructure the DC Instance is a tool that can lead to collaboration, by enabling researchers to find someone else's data products and assess the applicability of those data to their own research. 7 AcknowledgementsThe Data Conservancy is funded by the National Science Foundation under grant number OCI-0830976. Funding for the Data Conservancy and the Johns Hopkins University Data Management Services is provided by the JHU Sheridan Libraries. We acknowledge contributions from our Data Conservancy colleagues and their remarks on earlier versions of this paper. 8 References[1] Agre, P.E. (2003). Information and institutional change: The case of digital libraries, in A.P. Bishop, N.A. Van House, & B.P. Buttenfield (Eds.) Digital Library Use: Social Practice in Design and Evaluation, Cambridge, MA: MIT Press (pp. 219-240). [2] Changing the Conduct of Science in the Information Age: Summary Report of Workshop Held on November 12, 2010 National Science Foundation. (2011). Washington, D.C.: National Science Foundation. [3] Consultative Committee for Space Data Systems (CCSDS). (2012). Reference Model for an Open Archival Information System (OAIS). Recommendation for space data system standards, CCSDS 650.0-M-2. [4] Choudhury, G.S. (2012). Data Conservancy Stack Model for Data Management. Council on Library and Information Resources (CLIR). [5] Delserone, L.M. (2008). At the watershed: preparing for research data management and stewardship at the University of Minnesota Libraries. Library Trends, 57(2): 202-210. http://hdl.handle.net/2142/10670 [6] Duraspace. (2012). Fedora Repository Commons Software: Specsheet. [7] Garritano, J.R. & Carlson, J.R. (2009). A Subject Librarian's Guide to Collaborating on e-Science Projects. Issues in Science and Technology Librarianship, 57. [8] Lagoze, C. & Patzke, K. (2011). A research agenda for data curation cyberinfrastructure. Proceeding of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 373-382). Ottawa, Ontario, Canada: ACM. http://dx.doi.org/10.1145/1998076.1998145 [9] Lavoie, B.F. (2012). Sustainable research data. In G. Pryor (Ed.) Managing Research Data, London: Facet Publishing (pp. 67-82). [10] Li, Y. & Banach, M. (2011). Institutional Repositories and Digital Preservation: Assessing Current Practices at Research Libraries. D-Lib Magazine, Volume 17, Number 5/6. http://dx.doi.org/10.1045/may2011-yuanli [11] Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century. (2005). Washington, D.C.: National Science Foundation, National Science Board. [12] National Science Foundation (NSF). (2012a). Chapter II proposal preparation instructions: special information and supplementary documentation. [13] National Science Foundation (NSF). (2012b). Award and Administration Guide, Chapter V Allowability of Costs. [14] Palmer, C.L., Weber, N.M., & Cragin, M.H. (2011). The Analytic Potential of Scientific Data: Understanding Re-use Value. Proceedings of the American Society for Information Science & Technology, 48(1): 1-10. http://dx.doi.org/10.1002/meet.2011.14504801174 [15] Renear, A.H., Sacchi, S., & Wickett, K. M. (2010). Definitions of dataset in the scientific and technical literature. Proceedings of the American Society for Information Science and Technology, 47(1): 1-4. http://dx.doi.org/10.1002/meet.14504701240 [16] Salo, D. (2010). Retooling Libraries for the Data Challenge. Ariadne, Issue 64. [17] Treloar, A., Choudhury, G.S., & Michener, W. (2012). Contrasting national research data strategies: Australia and the USA. In G. Pryor (Ed.) Managing Research Data, London: Facet Publishing (pp. 173-203). [18] University of California, San Diego (UCSD). (2009). Blueprint for the Digital University: A Report of the UCSD Research Cyberinfrastructure Design Team. [19] University of North Carolina, Chapel Hill (UNC-CH). (2012). Research Data Stewardship at UNC: Recommendations for Scholarly Practice and Leadership. [20] Varvel, V.E., Jr., Palmer, C.L., Chao, T., & Sacchi, S. (2011). Report from the Research Data Workforce Summit: Sponsored by the Data Conservancy. Champaign, IL: Center for Informatics Research in Science & Scholarship, University of Illinois. http://www.ideals.illinois.edu/handle/2142/25830 [21] Walters, T. & Skinner, K. (2011). New Roles for New Times: Digital Curation for Preservation. Washington, DC: Association of Research Libraries. [22] Witt, M. (2012). Co-designing, Co-developing, and Co-implementing an Institutional Data Repository Service. Journal of Library Administration, 52(2). http://dx.doi.org/10.1080/01930826.2012.655607 [23] Wynholds, L., Fearon, D., Borgman, C.L., & Traweek, S. (2011). Awash in stardust: data practices in astronomy. Proceedings of the 2011 iConference (pp. 802-804). Seattle, Washington: ACM. http://dx.doi.org/10.1145/1940761.1940912 About the Authors
|
|||||||||||||||||||
|