D-Lib Magazine

Search D-Lib:

September/October 2013
Volume 19, Number 9/10
Table of Contents

Guest Editorial

Scientific Publications: Gathering Data, Extracting Information, and Following Trends

Petr Knoth and Zdenek Zdrahal, Knowledge Media Institute, The Open University
Nuno Freire and Markus Muhr, The European Library, Europeana

Point of contact for this editorial: Petr Knoth, p.knoth@open.ac.uk

doi:10.1045/september2013-guest_editorial

Printer-friendly Version

Digital libraries that store scientific publications continue to be increasingly important in research. They are used not only for the traditional tasks of finding and storing research outputs, but also as data sources for mass automated processing to discover new knowledge, explore new research trends and evaluate research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines to analyze this information, and by doing so, allow the emergence of new applications that will facilitate the processes by which research is accomplished.

Developing infrastructures, aggregating research outputs, text-mining and analyzing large volumes of research papers, discovering and staying current with new research trends and understanding how to recognize excellence are only some of the hot topics the digital libraries community deals with. Solving these tasks requires an array of tools that must continue to evolve to take advantage of the latest research and technical developments.

The articles in this issue of D-Lib Magazine were selected from among papers presented at the 2nd International Workshop on Mining Scientific Publications, held in conjunction with JCDL 2013, in Indianapolis, Indiana, USA. The main topics of the workshop were infrastructures for gathering and analyzing large volumes of scientific publications, methods and tools for extracting information from research papers, and other areas related to the tracking of trends and evaluation of research excellence. The two research objectives described in the articles in this issue are (1) methods for extracting selected types of metadata from scientific publications, and (2) providing an analysis of scientific publications that will shed light on the use and focus of those collections in a wider research context. Three of the articles focus on the first objective, and two focus on the second.

In their paper entitled "A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries", Muhammad Imran from Qatar Computing Research and his colleagues at the University of Trento in Italy, propose a method for disambiguating authors' names, assigning names to the correct individual authors, and the correct mapping of authors to publications. The multi-layer hierarchical clustering algorithm described in the paper makes use of a number of metadata attributes, among them co-authorship, authors' affiliation, and publication title. Additionally, users can intervene in the query to improve the quality of the result. Roman Kern from Graz University of Technology, and Stefan Klampfl from Know-Center in Graz, approach the problem of reference metadata extraction from a different angle in their paper "Extraction of References Using Layout and Formatting Information from Scientific Articles". Their interesting approach includes providing additional formatting information extracted from PDF files that is then used by the existing reference string parsing package ParsCit. Nicolai Erbs, Iryna Gurevych and Marc Rittberger, of the Technische Universität Darmstadt and the German Institute for Educational Research and Educational Information address the task of extracting keyphrases and assigning index terms to scientific papers in their article "Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment". They conclude that keyphrase extraction provides better recall but lower precision, and therefore they propose a combined approach. Experiments were run mainly with German text already indexed by professional indexers, but for comparison, the method was also applied to English documents.

Two papers aim to answer questions of wider research interest by analyzing scientific publications. Francesco Osborne and Enrico Motta, of Knowledge Media Institute from the Open University, investigate the changing professional interests of researchers, current trends in research topics and networks of collaborating researchers, in their article "Exploring Research Trends with Rexplore". Their results are supported by network visualizations. In "Multi-year Content Analysis of User Facility Related Publications", Robert M. Patton and his colleagues from Oak Ridge National Laboratory extend their research, described in a previous issue of D-Lib Magazine, on the impact evaluation of sophisticated facilities provided by their laboratory to the scientific community. The analysis of publications that refer to their facilities over the last five years can be used to help inform the lab's decision-making in the allocation of facilities, and justifying resource allocation.

We believe that the articles included in this special issue of D-Lib Magazine, which report on some of the latest development work in the area of mining scientific publications, will help to motivate further research in this important domain. We hope readers will find them informative and useful.

About the Guest Editors

Petr Knoth is a researcher in the Knowledge Media institute, Open University interested in topics in Natural Language Processing, Information Retrieval, Digital Libraries and the Semantic Web. He is a huge Open Access enthusiast - believing in free access to knowledge for everybody. He acknowledges the necessity of migrating towards better research practices and criticizes narrow-minded methods for evaluating research excellence. Petr is the founder of the CORE system for aggregating open access content and has worked as the lead architect, developer and manager for the CORE family of projects (CORE, ServiceCORE, DiggiCORE). He was also involved in a number of European Commission funded (Europeana Cloud, KiWi, Eurogene, Tech-IT-Easy, Decipher) as well as UK national (RETAIN, OARR) projects.

Zdenek Zdrahal is a Senior Research Fellow at Knowledge Media Institute of the Open University and Associate Professor at The Faculty of Electrical Engineering, Czech Technical University. He has been a project leader and principal investigator in a number of research projects in the UK, Czech Republic, and Mexico. His research interests include knowledge modelling and management, reasoning, KBS in engineering design, and Web technology. He is an Associate Editor of IEEE Transactions on Systems, Man and Cybernetics.

Nuno Freire is is a Senior Researcher at The European Library. He holds a PhD in Informatics and Computer Engineering from the Instituto Superior Técnico of the Technical University of Lisbon. During his entire career he has been involved in research projects in the area of digital libraries. His areas of interest include information systems, information retrieval, information extraction, data quality, and knowledge representation, particularly in their application to digital libraries and bibliographic data.

Markus Muhr is technical manager at The European Library. Prior to joining The European Library team he worked at the Austrian Competence Center for Knowledge Management in the field of Machine Learning and Data Mining. He studied Telematics at Technical University Graz and has published and presented several papers at high level conferences. His primary interests are in Information Retrieval, Data Mining and Digital Libraries. He was one of the core developers leading up to a complete redesign of the portal and the ingestion processes at The European Library and is now managing technical developments at The European Library. He has been involved in multiple Europen Commission funded projects, including Europeana Libraries, Europeana Newspaper and Europeana Cloud.