Scientific Publications: Gathering Data, Extracting Information, and Following Trends
Petr Knoth and Zdenek Zdrahal, Knowledge Media Institute, The Open University
Digital libraries that store scientific publications continue to be increasingly important in research. They are used not only for the traditional tasks of finding and storing research outputs, but also as data sources for mass automated processing to discover new knowledge, explore new research trends and evaluate research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines to analyze this information, and by doing so, allow the emergence of new applications that will facilitate the processes by which research is accomplished.
Developing infrastructures, aggregating research outputs, text-mining and analyzing large volumes of research papers, discovering and staying current with new research trends and understanding how to recognize excellence are only some of the hot topics the digital libraries community deals with. Solving these tasks requires an array of tools that must continue to evolve to take advantage of the latest research and technical developments.
The articles in this issue of D-Lib Magazine were selected from among papers presented at the 2nd International Workshop on Mining Scientific Publications, held in conjunction with JCDL 2013, in Indianapolis, Indiana, USA. The main topics of the workshop were infrastructures for gathering and analyzing large volumes of scientific publications, methods and tools for extracting information from research papers, and other areas related to the tracking of trends and evaluation of research excellence. The two research objectives described in the articles in this issue are (1) methods for extracting selected types of metadata from scientific publications, and (2) providing an analysis of scientific publications that will shed light on the use and focus of those collections in a wider research context. Three of the articles focus on the first objective, and two focus on the second.
In their paper entitled "A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries", Muhammad Imran from Qatar Computing Research and his colleagues at the University of Trento in Italy, propose a method for disambiguating authors' names, assigning names to the correct individual authors, and the correct mapping of authors to publications. The multi-layer hierarchical clustering algorithm described in the paper makes use of a number of metadata attributes, among them co-authorship, authors' affiliation, and publication title. Additionally, users can intervene in the query to improve the quality of the result. Roman Kern from Graz University of Technology, and Stefan Klampfl from Know-Center in Graz, approach the problem of reference metadata extraction from a different angle in their paper "Extraction of References Using Layout and Formatting Information from Scientific Articles". Their interesting approach includes providing additional formatting information extracted from PDF files that is then used by the existing reference string parsing package ParsCit. Nicolai Erbs, Iryna Gurevych and Marc Rittberger, of the Technische Universität Darmstadt and the German Institute for Educational Research and Educational Information address the task of extracting keyphrases and assigning index terms to scientific papers in their article "Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment". They conclude that keyphrase extraction provides better recall but lower precision, and therefore they propose a combined approach. Experiments were run mainly with German text already indexed by professional indexers, but for comparison, the method was also applied to English documents.
Two papers aim to answer questions of wider research interest by analyzing scientific publications. Francesco Osborne and Enrico Motta, of Knowledge Media Institute from the Open University, investigate the changing professional interests of researchers, current trends in research topics and networks of collaborating researchers, in their article "Exploring Research Trends with Rexplore". Their results are supported by network visualizations. In "Multi-year Content Analysis of User Facility Related Publications", Robert M. Patton and his colleagues from Oak Ridge National Laboratory extend their research, described in a previous issue of D-Lib Magazine, on the impact evaluation of sophisticated facilities provided by their laboratory to the scientific community. The analysis of publications that refer to their facilities over the last five years can be used to help inform the lab's decision-making in the allocation of facilities, and justifying resource allocation.
We believe that the articles included in this special issue of D-Lib Magazine, which report on some of the latest development work in the area of mining scientific publications, will help to motivate further research in this important domain. We hope readers will find them informative and useful.
About the Guest Editors