P R I N T E R - F R I E N D L Y  F O R M A T Return to Editorial

D-Lib Magazine

November/December 2014
Volume 20, Number 11/12


Guest Editorial

New Opportunities, Methods and Tools for Mining Scientific Publications

Peter Knoth, Drahomira Herrmannova, Lucas Anastasiou and Zdenek Zdrahal
Knowledge Media Institute, The Open University, UK

Kris Jack
Mendeley, Ltd., UK

Nuno Freire
The European Library, The Netherlands

Stelios Piperdis
Athena Research Center, Greece

Corresponding Author: Petr Knoth, p.knoth@open.ac.uk



Aggregating (Open Access) research outputs, developing infrastructures for text-mining, extracting entities and relations from research literature, analysing trends from large volumes of research papers and assessing research impact are only some of the key challenges our community deals with. Solving these tasks requires an array of tools that must continue to evolve to take advantage of the latest research and technical developments.

The articles in this issue of D-Lib Magazine were selected from papers submitted to the 3rd International Workshop on Mining Scientific Publications (WOSP 2014) organised by the Open University, Mendeley Ltd., The European Library and Athena Research Centre, held in conjunction with the Digital Libraries 2014 conference (DL 2014) in London, UK. This year has seen a significant growth in the number of both submissions and workshop participants. The Programme Committee selected in a peer-review process 6 long, 5 short and 3 tool papers to be part of this D-Lib issue. The papers can be divided into five general topics: infrastructures (2 papers), semantic enrichment (6 papers), text-mining tools (3 papers), research impact (2 papers) and social & legal aspects (1 paper).

We believe the papers in this issue present a number of novel ideas, demonstrating the progress made in this domain. A significant proportion of the new approaches presented in this issue address a wide range of problems in extracting structured information, and even detailed semantics, from research papers. They start with efforts to recover valuable information that is lost during the publishing process, such as extracting tables (A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles), citations (Efficient Blocking Method for a Large Scale Citation Matching) and mathematical formulas (Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers). The article titled GROTOAP2 — The methodology of Creating a Large Ground Truth Dataset of Scientific Articles discusses an approach to creating an annotated dataset of research papers, which is crucial for evaluating the continuous advancement of these types of methods. The enrichment methods also include efforts to automatically extract semantics from images (AMI-diagram: Mining Facts from Images), the induction of subject classification systems (A Keyquery-Based Classification System for CORE) as well as the detection and visualisation of interdisciplinary research relations (Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications) and the collaborative annotation of research papers (Annota: Towards Enriching Scientific Publications with Semantics and User Annotations).

In the past, these efforts were significantly limited by the ability of researchers to access and mine large quantities of papers copyrighted by a handful of the largest commercial publishers. With the Open Access movement now stronger than ever — thanks to recent funders, institutional and governmental policies, and text-mining copyright exceptions — it is now possible to deliver tangible benefits to society, such as facilitating navigation across research papers (The Architecture and Datasets of Docear's Research Paper Recommender System). While some technical (The ContentMine Scraping Stack: Literature-scale Content Mining with Community-maintained Collections of Declarative Scrapers) as well as organisational challenges (Social, Political and Legal Aspects of Text and Data Mining) still have to be overcome, it is clearly now much easier for publishers to deliver the benefits of the work described here to researchers.

Consequently, as the data are now becoming more open to researchers than ever before, it can be expected that researchers will soon be able to join forces by openly sharing a significant set of extraction tools, tools which can be used to develop real-world applications. The paper titled Towards a Marketplace for the Scientific Community: Accessing Knowledge from the Computer Science Domain analyses the types of information that can be extracted from research literature and discusses how it can benefit different stakeholders.

Additionally, this wide open dissemination of research data also provides new opportunities for research evaluation. Such evaluation can be carried out at multiple levels, such as for conferences (Experiments on Rating Conferences with CORE and DBLP) that benefit researchers and papers. The article Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing a Research Publication's Contribution, which won the best paper award at WOSP 2014, argues for the need to make use of the availability of full-texts to move beyond Bibliometrics, Altmetrics and Webometrics. These are based on a false premise that the quality of research can be measured as the number of interactions in the scholarly network, be it citations, tweets, clicks or downloads. The paper presents a new method for assessing research contribution based on the assumption that the contribution of a publication is characterised by the semantic shift a publication encourages, and provides a formula to calculate it.

We believe that the articles included in this special issue of D-Lib Magazine will help to motivate further research in this important domain. We hope readers will enjoy reading them and will find them useful.


About the Guest Editors


Petr Knoth is a Research Fellow at the Knowledge Media Institute, Open University. He is interested in topics in Natural Language Processing, Information Retrieval and Digital Libraries. He is an Open Access and Open Science enthusiast — believing in free access to knowledge for everybody and in the development of more effective means of exploiting this knowledge. He acknowledges the necessity of migrating towards better research practices and criticises narrow-minded methods for evaluating research excellence. Petr is the founder of the CORE system for aggregating and mining Open Access content and has led the CORE family of projects (CORE, ServiceCORE, DiggiCORE, UK Aggregation). He was also involved in a number of European Commission funded (Europeana Cloud, KiWi, Eurogene, Tech-IT-Easy, Decipher, FOSTER) as well as UK national (RETAIN, OARR) projects.


Kris Jack is the Chief Data Scientist at Mendeley and is responsible for the development of their data science technologies. He has over ten years of experience in both academia (PhD in AI, University of Dundee; Research Associate, NaCTeM, UK) and industry (Expert R&D Engineer in Orange Labs and the CEA) of solving complex large scale data problems.


Nuno Freire holds a PhD in Informatics and Computer Engineering and is currently the Chief Data Officer at The European Library. During his entire career he has been involved in data oriented projects in the area of digital libraries. His areas of interest include information systems, information retrieval, information extraction, data quality, and knowledge representation, particularly in their application to digital libraries and bibliographic data. In his recent work at The European Library he conducted the governance and utilization of the aggregated data from European national bibliographies and special library collections via data integration, processing, analysis, and data mining.


Stelios Piperidis is a senior researcher, Head of the Natural Language and Knowledge Extraction Department, at the Institute for Language and Speech Processing (ILSP) "Athena" Research Centre. He is the national scientific coordinator for the CLARIN Research Infrastructure in Greece, member of the FLaReNet Steering Committee and the META-NET Executive Board, supervisor of the META-SHARE infrastructure. From 2008 to 2012 he served as the President of the European Language Resources Association. He is a lecturer for Language Technology and Logic at the postgraduate programmes of the University of Athens and the National Technical University of Athens. He holds degrees in Electrical Engineering and Computer Science from the National Technical University of Athens and the Imperial College of Science, Technology and Medicine, University of London. His research interests include statistical and deductive methods in natural language processing and understanding, language resources and automatic linguistic knowledge elicitation, machine translation and philosophy of language.


Drahomira Herrmannova is a Research Student at the Knowledge Media Institute, Open University, working under the supervision of Professor Zdenek Zdrahal and Mr Petr Knoth. Her research interests include bibliometrics, citation analysis, research evaluation and natural language processing. She completed her BS and MS degrees in Computer Science at Brno University of Technology, Czech Republic. Aside from her PhD work, she participated in research projects at the Knowledge Media Institute (CORE, OU Analyse).


Lucas Anastasious is a Research Assistant at the Knowledge Media Institute, Open University. His work involves developing software to harvest open access academic publications stored in Open Access repositories or Open Access journals, using text mining techniques to process the collected information and discover unique features across publications (e.g. citation analysis-impact factor analysis, provide similar documents similarities, identify research areas). He holds degrees in Electrical and Computer Engineering from National Technical University of Athens (NTUA) and Information Security from University College London (UCL). He participated in research projects funded by the European Commission (Stellar, Europeana Cloud) and UK (Edukapp, Crunch, DiggiCORE).

Photo of Zdenek Zdrahal

Zdenek Zdrahal is a Senior Research Fellow at Knowledge Media Institute of the Open University and Associate Professor at The Faculty of Electrical Engineering, Czech Technical University. He has been a project leader and principal investigator in a number of research projects in the UK, Czech Republic, and Mexico. His research interests include knowledge modelling and management, reasoning, KBS in engineering design, and Web technology. He is an Associate Editor of IEEE Transactions on Systems, Man and Cybernetics.

P R I N T E R - F R I E N D L Y  F O R M A T Return to Editorial