Science, Publishing, and Digital Libraries (Again)
by Laurence Lannom, Corporation for National Research Initiatives


Scientific Publications: Gathering Data, Extracting Information, and Following Trends
Guest Editorial by Petr Knoth and Zdenek Zdrahal, Knowledge Media Institute, The Open University; Nuno Freire and Markus Muhr, The European Library, Europeana



A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries
Article by Muhammad Imran, Qatar Computing Research Institute, Doha, Qatar; Syed Zeeshan Haider Gillani and Maurizio Marchese, University of Trento, Trento, Italy

Abstract: This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an efficient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users' interactions in order to include users' feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, affiliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers' names considered to be highly ambiguous produced high precision and recall results, and decisively affirmed the viability of our algorithm.

Extraction of References Using Layout and Formatting Information from Scientific Articles
Article by Roman Kern, Knowledge Technologies Institute, Graz University of Technology, Graz, Austria and Stefan Klampfl, Know-Center, Graz, Austria

Abstract: The automatic extraction of reference meta-data is an important requirement for the efficient management of collections of scientific literature. An existing powerful state-of-the-art system for extracting references from a scientific article is ParsCit; however, it requires the input document to be converted into plain text, thereby ignoring most of the formatting and layout information. In this paper, we quantify the contribution of this additional information to the reference extraction performance by an improved preprocessing using the information contained in PDF files and retraining sequence classifiers on an enhanced feature set. We found that the detection of columns, reading order, and decorations, as well as the inclusion of layout information improves the retrieval of reference strings, and the classification of reference token types can be improved using additional font information. These results emphasize the importance of layout and formatting information for the extraction of meta-data from scientific articles.

Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment
Article by Nicolai Erbs, Ubiquitous Knowledge Processing Lab, Technische Universität Darmstadt, Iryna Gurevych, Ubiquitous Knowledge Processing Lab, Technische Universität Darmstadt and German Institute for Educational Research and Educational Information and Marc Rittberger, German Institute for Educational Research and Educational Information

Abstract: Collections of topically related documents held by digital libraries are valuable resources for users; however, as collections grow, it becomes more difficult to search them for specific information. Structure needs to be introduced to facilitate searching. Assigning index terms is helpful, but it is a tedious task even for professional indexers, requiring knowledge about the collection in general, and the document in particular. Automatic index term assignment (ITA) is considered to be a great improvement. In this paper we present a hybrid approach to index term assignment, using a combination of keyphrase extraction and multi-label classification. Keyphrase extraction efficiently assigns infrequently used index terms, while multi-label classification assigns frequently used index terms. We compare results to other state-of-the-art approaches for related tasks. The assigned index terms allow for a clustering of the document collection. Using hybrid and individual approaches, we evaluate a dataset consisting of German educational documents that was created by professional indexers, and is the first one with German data that allows estimating performance of ITA on languages other than English.

Exploring Research Trends with Rexplore
Article by Francesco Osborne and Enrico Motta, Knowledge Media Institute, The Open University

Abstract: Current systems for exploring scholarly data exhibit a number of shortcomings in their ability to facilitate the identification of research trends and identify 'interesting' connections between researchers. To address these issues we have developed Rexplore, a novel system which combines statistics, human-computer interaction, and semantic technologies, to support knowledge-based exploration and visualization of scholarly data. In this paper we focus on the functionalities provided by Rexplore for visualizing research trends and we use as an example research in "Social Networks", which has experienced dramatic growth in the years 2000-2010.

Multi-year Content Analysis of User Facility Related Publications
Article by Robert M. Patton, Christopher G. Stahl, Jayson B. Hines, Thomas E. Potok, Jack C. Wells, Oak Ridge National Laboratory

Abstract: Scientific user facilities provide resources and support that enable scientists to conduct experiments or simulations pertinent to their respective research. Consequently, it is critical to have an informed understanding of the impact and contributions that these facilities have on scientific discoveries. Leveraging insight into scientific publications that acknowledge the use of these facilities enables more informed decisions by facility management and sponsors in regard to policy, resource allocation, and influencing the direction of science, as well as a more effective understanding of the impact of a scientific user facility. This work discusses preliminary results of mining scientific publications that utilized resources at the Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL). These results show promise in identifying and leveraging multi-year trends and providing a higher resolution view of the impact that a scientific user facility may have on scientific discoveries.


C O N F E R E N C E  R E P O R T

2013 Open Repositories Conference Highlights: Repository Island in Sea of Research Data
Conference Report by Carol Minton Morris, DuraSpace

Abstract: The Eighth International Conference on Open Repositories 2013 was held July 8 - 12, 2013 on Prince Edward Island, Canada. The annual conference offers attendees an opportunity to learn about new ways to access information, innovative repository tools, and emerging community initiatives. More than 300 attendees came to OR2013 to meet with colleagues, keep up with fast-paced development goals, and hear expert speakers who are attuned to current repository issues.


F E A T U R E D   D I G I T A L


Digital Public Library of America


Still from a home movie of a baseball game between African American employees of the Pebble Hill Plantation and another neighboring plantation, Thomas County, Georgia, circa 1919.
[Pebble Hill Plantation Film Collection, Walter J. Brown Media Archives and Peabody Awards Collection, University of Georgia Libraries.]

Benjamin Sewall Blake jumping, ca. 1888. Francis Blake, photographer.
[From the Massachusetts Historical Society, part of the Digital Commonwealth. Used with permission.]


The Digital Public Library of America (DPLA) offers a single point of access to over 4.5 million digital items — photographs, manuscripts, books, sounds, moving images, and more — from libraries, archives, historical societies, and museums around the United States. (The collection is scheduled to expand in the fall to over 5 million.) Highlights from the current DPLA collection include portraits and daguerreotypes of former US presidents; news film clips of the Freedom Riders during the Civil Rights movement; the Book of Hours, an illuminated manuscript from 1514; Notes on the State of Virginia, written by Thomas Jefferson; paintings by Winslow Homer; and nearly 400,000 historic photographs from the Mountain West, South, Midwest, and Northeast dating back to the earliest days of photography. The most recent additions to the collection include WPA household census records mapped to street level from the University of Southern California, and over a million and a half books, serials, atlases, government documents, and more from HathiTrust's public domain content.

At the core of the DPLA is its data store of metadata records from over 650 institutions around the US. Each DPLA provider, or partner, agrees to share their metadata openly, and to express this openness with a Creative Commons Zero ("CC0") dedication. These partners make up the DPLA's Digital Hubs Program, a national pilot to construct a network out of the over 40 state or regional digital collaboratives, numerous large content repositories, and other promising digital initiatives currently in operation throughout the US. Current hubs include the Smithsonian Institution, the National Archives and Records Administration, the David Rumsey Map Collection, ARTstor, the Mountain West Digital Library, the Minnesota Digital Library, the Digital Library of Georgia, and many more.

DPLA Stats 2013
Partners: Approx. 650
Number of Items: Over 4.5 million
    Photos/Visual Materials: 1 million
    Books & Serials: Over 750,000
    Maps: Approx. 65,000
    Obituaries: 50,000
    Videos: 10,000
    Oral Histories: 5,000
    Newspaper Titles: Hundreds

