D-Lib Magazine
The Magazine of Digital Library Research
transparent image

D-Lib Magazine

September/October 2013
Volume 19, Number 9/10
Table of Contents


Multi-year Content Analysis of User Facility Related Publications

Robert M. Patton, Christopher G. Stahl, Jayson B. Hines, Thomas E. Potok, Jack C. Wells
Oak Ridge National Laboratory
{pattonrm, stahlcg, hinesjb, potokte, wellsjc}@ornl.gov



Printer-friendly Version



Scientific user facilities provide resources and support that enable scientists to conduct experiments or simulations pertinent to their respective research. Consequently, it is critical to have an informed understanding of the impact and contributions that these facilities have on scientific discoveries. Leveraging insight into scientific publications that acknowledge the use of these facilities enables more informed decisions by facility management and sponsors in regard to policy, resource allocation, and influencing the direction of science, as well as a more effective understanding of the impact of a scientific user facility. This work discusses preliminary results of mining scientific publications that utilized resources at the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL). These results show promise in identifying and leveraging multi-year trends and providing a higher resolution view of the impact that a scientific user facility may have on scientific discoveries.

Keywords: Scientific User Facility, Trend Analysis, Text Analysis, Algorithms, Design


1. Introduction

Scientific user facilities such as Spallation Neutron Source (SNS) and European Synchrotron Radiation Facility (ESRF) provide physical resources and technical support that enable scientists to conduct experiments or simulations pertinent to their respective research. Such facilities provide significant capabilities not found anywhere else. Consequently, both facility management and sponsors want to know what impact their respective facility has on scientific discovery. Justification for the existence and funding of these facilities drives the need for appropriate performance metrics of the facility. One performance metric is the number of publications that users produce as a direct result of using the facility.

Oak Ridge National Laboratory (ORNL) is currently home to eight user facilities. One of these facilities is Oak Ridge Leadership Computing Facility (OLCF), which was established in 2004 with the purpose of creating and supporting a supercomputer 100 times more powerful than current computer platforms [4]. OLCF provides high performance computing resources to support breakthrough research and scientific discovery in a wide range of fields such as climate, chemistry, biology, physics, and energy. Unlike many scientific user facilities, OLCF is unique in that it impacts a much wider range of scientific domains. As of 2013, OLCF is home to Titan, the world's fastest open science, unclassified supercomputer, which has a peak performance of more than 20 petaflops (floating point operations per second). Titan consists of nearly 300k processor cores, 710 terabytes of system memory, and over 18k of GPU (graphics processing unit) accelerators.

One of the ways that scientists around the world can access this user facility is through the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program [2]. Each year, this facility requests proposals for research that would utilize OLCF's computing resources [3]. These proposals describe the science that will be performed using OLCF resources as how many "core hours" will be needed to perform the project. For example, a proposal regarding simulation of climate change may request 10 million core hours of Titan access. Like funding, OLCF has a maximum number of core hours each year that must be wisely distributed to the various proposals. Consequently, it is imperative to define and measure a return on investment (ROI) of core hours into the selected proposals. The work described here shows an initial attempt to harness scientific publications over a multi-year period as one aspect of defining and measuring that ROI.


2. Background

This work continues and leverages previous work described in [5]. A significant challenge faced by user facilities is the initial identification of authors who utilized their resources to produce publications. The solution in [5] enabled the automatic collection and filtering of 5 years worth of publications for OLCF spanning 2008 - 2012. The current work is now focused on the analysis of the content and distribution of these publications. The goal then is to begin the process of using the results of this analysis as a feedback loop to decision-making regarding resource allocation.


3. Related Work

This work is highly focused on a multi-year content analysis of scientific publications. This is a result of the fact that many questions of interest to facility management and sponsors require a multi-year perspective and that the impact of a user facility on scientific research may not be seen immediately, but rather over a period of time. With respect to this, other research has been conducted into the study of multi-year scientific publications.

In the work of [6], the national scientific output of Singapore over a 10-year period was evaluated with bibliometric analysis. A number of factors were analyzed including citation counts, total publication counts by year, document type (e.g., conference, journal, book) and impact of single author publications versus multi-author publications. The business case for the work of [6] was to evaluate Singapore with respect to other countries and how their position relative to other countries would impact investment into specific areas of science. Our work is similar to this, but on a smaller scale and focused on just scientific user facilities.

In the work of [1], a generic approach using cluster views and time-zone views is discussed in order to identify emerging trends and transient patterns in scientific publications. Two specific areas are evaluated: mass extinction (1981 - 2004) and terrorism (1990 - 2003). By using both cluster and time views of the data, the speed and direction of a particular research field can be more easily seen. Our work is similar, but expanded to include analysis of authors and publishing venues.


4. Analysis

As mentioned previously, the data set consists of publications from 2008 - 2012 that clearly acknowledge the use of OLCF facilities. The data used for analysis is actually a subset of a larger set of publications that, while not clearly acknowledging the use of OLCF resources, are a result of having used OLCF resources. The intention here is to limit the scope of data and understand the value of analysis prior to expanding to a more complete set of publications. Conclusions drawn here could be considered preliminary and representative of the analysis outcomes that may be accomplished, and not final outcomes. In addition, some of the results of the analysis are still considered business sensitive and are not available for release.

Initial analysis seeks to answer who, what, where, and when questions. Consequently, we focused on three areas: authorship, topics, and publication venue. Authorship provides a view into "who" is doing the impacting on the scientific community. The topics provide insight into "what" scientific areas are being impacted. Publication venue exposes "where" the scientific impact is occurring. Publication venue is significant for this business case in that it defines the audience that may be impacted because of research conducted using OLCF resources. This gives a first indication as to whether the audience is small, large, homogenous, heterogeneous, or multi-disciplinary. Finally, performing a multi-year content analysis provides insight into the "when" questions.


4.1 Authors

Authorship for OLCF related papers presents several challenges. First, some authors did not have direct access to OLCF resources but were collaborators. Next, authors may not necessarily be the principle investigator (PI) of an OLCF project. In addition, some authors may be neither a direct user of OLCF resources nor a PI of an OLCF project, but used results generated from others who used OLCF resources to get the results. Other challenges also exist in regard to authorship.

For this work, initial analysis focused strictly on the first author of the paper, how many OLCF related publications that author published, and whether or not that author published at least one OLCF related paper over multiple years. Tables 1 through 5 show these initial results. Of particular interest in these results is that most of the prolific authors are also multi-year authors. While not shown here due to space constraints, many authors who are not prolific (only 1 publication in a particular year) are also not multi-year authors.


Table 1: Most frequent authors for 2008

Author Number of Publications Multi-Year
Joost VandeVondele 4 Yes
Robert Harrison 3 Yes
Wenchang Lu 3 Yes
Liping Huang 2 Yes
Michael Kuhlen 2 No
John Mellor-Crummey 2 No
Cho Ng 2 Yes
Xiaohua Zhang 2 No

Table 2: Most frequent authors for 2009

Author Number of Publications Multi-Year
Nikolai Pogorelov 3 Yes
Leopold Grinberg 2 Yes
Robert Harrison 2 Yes
Liping Huang 2 Yes
Cheng Liu 2 Yes
Sujata Paul 2 No
Jun Zhou 2 Yes

Table 3: Most frequent authors for 2010

Author Number of Publications Multi-Year
Lin-Wang Wang 4 Yes
David Bowler 3 No
Mathieu Luisier 3 Yes
Di Wang 3 Yes
Ye Xu 3 Yes

Table 4: Most frequent authors for 2011

Author Number of Publications Multi-Year
Lin-Wang Wang 5 Yes
Pablo Carrica 3 Yes
Christopher Mundy 3 Yes
Alexey Volkov 3 Yes

Table 5: Most frequent authors for 2012

Author Number of Publications Multi-Year
Bobby Sumpter 7 Yes
Rong Yu 5 Yes
Leonid Zhigilei 3 No

4.2 Topics

In regard to topics, various levels of granularity can be shown. In this particular work, topics were analyzed by the year. At a high level, what overall direction did OLCF related publications focus on in a particular year. Future work will investigate finer granularities of topics with respect to specific science areas. Figures 1 through 5 show word clouds generated using an online tool called Wordle. Word clouds use the frequency of a word as a means of showing how some words are more prominent or significant than other words. For each publication in a given year, the Term Frequency - Inverse Corpus Frequency (TF-ICF) term weighting scheme [7] was used to identify the most significant terms for each publication. Then, for each significant term, the frequency across all papers was computed. The most frequent, significant terms across all publications for each year were then visualized using Wordle. Of particular interest in the word clouds is the increase in high performance computing related terms (e.g., multicore, gpu, mpi, cuda, scalability) in years 2011 and 2012. This is prior to Titan becoming operational.


Figure 1: Significant terms from 2008

Figure 2: Significant terms from 2009


Figure 3: Significant terms from 2010


Figure 4: Significant terms from 2011


Figure 5: Significant terms from 2012


4.3 Publication Venue

Publication venue provides another significant aspect of measuring impact for a user facility. Where a paper is published can significantly affect the citation count. In addition, much like topics, analyzing the venues provides similar insight into the scientific domains that may be impacted by the user facility. Figure 6 shows the most frequently used venues for OLCF related papers from 2008 - 2012. One interesting aspect is that a significant number of venues for 2012 are related to high performance computing, prior to Titan becoming operational.


Figure 6: Most frequent publishing venues from 2008 - 2012


5. Future Work & Summary

This work shows several promising opportunities to leverage scientific publications as one possible measure for return on investment into proposals that utilize a user facility's resources. One possible extension to this work is the introduction of social network analysis via co-authorship. Many of the co-authors of these publications are not direct users of the user facility. Consequently, this represents a secondary impact that user facilities may have on the science community. Another area of opportunity is investigating the impact of specific OLCF projects over a period of more than five years and how a specific project may produce future proposals to OLCF. In addition, additional tool support such as ParsCit will be considered. ParsCit provides the ability to extract author names, affiliate institutions, and references from a publication.

This current effort provides an initial start in harnessing scientific publications as one aspect of defining and measuring the ROI of a scientific user facility's resources. After collecting a set of publications related to the Oak Ridge Leadership Computing Facility, we investigated three aspects: publication venues, content, and authorship. These aspects provide additional insight into the outcomes and impacts of existing OLCF projects that can then be used by facility management and sponsors to make more informed decisions with regard to policy, resource allocation, and influencing the direction of science.


6. Acknowledgements

This manuscript has been authored by Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-6285; managed by UT-Battelle, LLC, and used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory under contract DE-AC05-00OR22725 for the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.


7. References

[1] Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359-377.

[2] Innovative and Novel Computational Impact on Theory and Experiment (INCITE) Program

[3] 2014 INCITE Call for Proposals

[4] Oak Ridge Leadership Computing Facility Overview

[5] Patton, R. M.; Stahl, C. G.; Potok, T. E. & Wells, J. C. (2012). "Identification of User Facility Related Publications", D-Lib Magazine 18(7/8). http://doi.org/10.1045/july2012-patton.

[6] Rana, S. (2012). "Bibliometric analysis of output and visibility of science and technology in Singapore during 2000-2009." Webology, 9(1), Article 96.

[7] Reed, J. W., Jiao, Y., Potok, T. E., Klump, B. A., Elmore, M. T., and Hurson, A. R. (2006). "TF-ICF: A new term weighting scheme for clustering dynamic data streams", In Proc. of the 5th International Conference on Machine Learning and Applications, pp. 258-263.


About the Authors

Robert M. Patton received his PhD in Computer Engineering with emphasis on Software Engineering from the University of Central Florida in 2002. He joined the Computational Data Analytics group at Oak Ridge National Laboratory (ORNL) in 2003. His research at ORNL has focused on nature-inspired analytic techniques to enable knowledge discovery from large and complex data sets, and has resulted in approximately 30 publications pertaining to nature-inspired analytics and 3 patent applications. He has developed several software tools for the purposes of data mining, text analyses, temporal analyses, and data fusion, and has developed a genetic algorithm to implement maximum variation sampling approach that identifies unique characteristics within large data sets.


Christopher G. Stahl recently graduated with a Bachelor of Science from Florida Southern College. For the past year he has been participating in the Higher Education Research Experiences (HERE) program at Oak Ridge National Laboratory. His major research focuses on data mining, and data analytics. In the future he plans on pursing a PhD in Computer Science with a focus on Software Engineering.


Jayson B. Hines is a Project Manager for the Computing and Computational Sciences Directorate at Oak Ridge National Laboratory (ORNL). Prior to this he worked for 7 years as the Outreach Task Lead for the National Center for Computational Sciences at ORNL, overseeing communications and outreach activities, user assistance, and user publication tracking. He received his MBA from Liberty University and his Bachelor of Science from the University of Tennessee, Knoxville.


Thomas E. Potok is the founder and leader of the Computational Data Analytics Group at the Oak Ridge National Laboratory, and an adjunct professor at University of Tennessee in Computer Science. He is currently a principle investigator on a number of projects involving large scale data mining and agent technology. Prior to this he worked for 14 years at IBM's Software Solutions Laboratory in Research Triangle Park, North Carolina, where he conducted research in software engineering productivity. He has a BS, MS, and Ph.D. in Computer Engineering all from North Carolina State University. He has published 100+ papers, received 10 issued (approved) patents, an R&D 100 Award in 2007, and serves on a number of journal editorial boards, and conference organizing and program committees.


Jack C. Wells is the director of science for the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL). He is responsible for devising a strategy to ensure cost-effective, state-of-the-art scientific computing at the NCCS, which houses the Department of Energy's Oak Ridge Leadership Computing Facility (OLCF). In ORNL's Computing and Computational Sciences Directorate, Wells has worked as group leader of both the Computational Materials Sciences group in the Computer Science and Mathematics Division and the Nanomaterials Theory Institute in the Center for Nanophase Materials Sciences. During a sabbatical, he served as a legislative fellow for Senator Lamar Alexander, providing information about high-performance computing, energy technology, and science, technology, engineering, and mathematics education issues.Wells began his ORNL career in 1990 for resident research on his Ph.D. in Physics from Vanderbilt University. Following a three-year postdoctoral fellowship at Harvard University, he returned to ORNL as a staff scientist in 1997 as a Wigner postdoctoral fellow. Jack is an accomplished practitioner of computational physics and has been supported by the Department of Energy's Office of Basic Energy Sciences. Jack has authored or co-authored over 70 scientific papers and edited 1 book, spanning nanoscience, materials science and engineering, nuclear and atomic physics computational science, and applied mathematics.

transparent image