Report on the 2nd International Workshop on Historical Document Imaging and Processing (HIP'13)

Search D-Lib:

D-Lib Magazine

March/April 2014
Volume 20, Number 3/4
Table of Contents

Report on the 2nd International Workshop on Historical Document Imaging and Processing (HIP'13)

George V. Landon
Eastern Kentucky University
george.landon@eku.edu

doi:10.1045/march2014-landon

Printer-friendly Version

Abstract

The second International Workshop on Historical Document Imaging and Processing (HIP'13) was held August 24, 2013 in Washington DC, USA, in conjunction with the International Conference on Document Analysis and Recognition (ICDAR 2013). The workshop, which brought together an international group of researchers who work with historical documents, was complementary and synergistic to the work in analysis and recognition featured in the main ICDAR sessions. Technical areas covered in the workshop included information extraction and retrieval; reconstruction and degradation; text and image recognition and segmentation; and layout analysis and databases. The researchers, many with computer and engineering backgrounds, shared their ongoing work in building tools and methods to handle the digitization of historical documents.

Introduction

Great strides in digitizing and indexing the world's physical documents have been made in recent years. In parallel, preservation and access of digital-born materials are getting the academic focus necessary to handle coming centuries of collections. However, as digital repositories become the primary source for future scholars, we are in danger of limiting access to pieces that can be digitized and indexed using currently available technology. This has motivated groups of scholars around the world to find new methods to digitize and index documents that are currently inaccessible. Starting in 2011, researchers that had been working and presenting papers on novel methods to scan, index, and provide access to historical documents, held the first workshop on the topic. The first International Workshop on Historical Document Imaging and Processing (HIP'11) was held in conjunction with the International Conference on Document Analysis and Recognition (ICDAR 2011).

This past year, the second International Workshop on Historical Document Imaging and Processing (HIP'13) was held in conjunction with ICDAR 2013. At HIP'13 researchers from around the globe met in Washington DC in August 2013 to discuss their work toward building tools and methods to handle digitizing historical documents. The majority of the participants came from computer science and electrical engineering. As those in the digital libraries fields already understand, the open problems in digital libraries are also open problems in these other domains. For researchers coming from computer science and engineering, work with historical documents has the added benefit of making culturally significant information accessible.

Opportunities Abound

The more we digitize documents the more we realize that our current software and methods are unable to cope with the vast array of historical documentation stored in the world's memory institutions. Archivists and libraries at institutions previously and/or currently involved in digitization projects can readily point to cases where current tools fail. This was highlighted by a pre-workshop tour. Before the workshop began, participants were invited to tour the National Archives in Washington, DC. The tour highlighted numerous successful projects to digitize historical documents while also giving guests additional examples of where digitization is not currently possible.

Workshop attendees toured the National Archives and given up-close demonstration of specialized digitization equipment.

The workshop began the next day and the sessions titles themselves provide a glimpse at the breadth of the issues affecting digitization attempts for historical documents. "Information Extraction and Retrieval", "Reconstruction and Degradation", "Text and Image Recognition", and "Segmentation, Layout Analysis and Databases" are all critical areas that need improvements when handling historical documents.

Workshop Technical Program

Information Extraction and Retrieval

Extracting information from historical documents for follow-on retrieval remains an area of active research. In the first session, new methods for word spotting, populating ontologies, and feature detection were presented. This session included the paper "Contextual Word Spotting in Historical Manuscripts using Markov Logic Networks" by David Fernández, Simone Marinai, Josep Llados, and Alicia Fornés, which won the International Association for Pattern Recognition (IAPR) best paper award for the workshop. (The full proceedings are available here.)

Authors of "Contextual Word Spotting in Historical Manuscripts using Markov Logic Networks" receive the IAPR Best Paper Award.

Reconstruction and Degradation

We are currently seeing bias in many digitization projects that exclude fragile or damaged documents. This is certainly not due to a lack of importance of these documents, but due to limited digitization technologies that currently exist. Researchers recognize these gaps in digitization technology and are working to develop new methods designed for specific classes of fragile historical documents. There are still many document types that are difficult or impossible to digitize. Novel methods to address digitizing deteriorated negatives, excessively curved pages from bound documents, warped pages, and broken wooden documents were all presented.

Text and Image Recognition

As more and more documents are being digitized, performing recognition across very large collections is becoming a necessity. Papers in this session presented methods that improved recognition of handwritten text in historical documents, improved OCR, and even modeling and comparing art styles across renaissance face portraits to determine unknown arts.

Segmentation, Layout Analysis and Databases

Large collections need additional metadata to aid in indexing and retrieval; however, this information is difficult to manually extract for most documents and this is especially true for historical documents. In the last session, two unique techniques were presented for automatic segmentation of text and drawings within digitized images of historical documents. Another technique moved to segment individual Japanese characters. All of these solutions are made more difficult by intrinsic characteristics of historical documents such as handwritten text or wood block printing. The variations in all of these documents were highlighted by the advanced metadata introduced and implemented in the European Union's IMPACT repository.

Global Participation

All aspects of historical document processing are certainly global issues. Memory institutions across the world are working to find ways to digitize historical documents or at least preserve them until digitization is possible. The global nature of this research is highlighted by the diverse attendance at HIP'13. There were 70 attendees, up from 58 attendees at HIP'11. Thirty-one papers were submitted covering all areas of historical document processing. Each paper had 3 reviewers and 18 of the 31 papers were accepted for a 58% acceptance rate (an improvement from the 71% acceptance rate at HIP'11).

Country	# of Attendees	Country	# of Attendees
Canada	3	Japan	6
China	2	Qatar	1
France	7	Russian Federation	2
Germany	5	Spain	3
Greece	2	Sweden	1
Ireland	1	Switzerland	7
Israel	4	United Kingdom	5
Italy	1	United States	15

HIP'13 Competition

FamilySearch International (FSI) hosted a workshop-affiliated competition to extract information from a large number of handwritten Mexican marriage records. Participants were asked to group a scrambled collection of these records by the contents of certain sub-regions of the document. These sub-regions contained geographic and chronological information. Competition participants were evaluated based on correct classification of these images relative to ground truth.

The Next HIP Workshop

New unsolved problems seem to arise every day when dealing with digitizing historical documents. We expect to see many novel methods to handle these problems in future HIP sessions. There is also a strong interest in increasing participation by librarians and archivists. Currently, HIP'15 is being planned to occur with ICDAR 2015 in Tunis, Tunisia, September 26 - 30, 2015. Please consider submitting your research for presentation, or just attending to interact with other researchers working in the cross-section of historic documents and digital libraries.

References

2nd International Workshop on Historical Document Imaging and Processing (HIP'13), Washington, DC, USA, August 24, 2013 (website).

ACM Digital Library. (2013). "Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing". ACM, New York.

About the Author

George V. Landon is an Associate Professor in Computer Science at Eastern Kentucky University. He received a Ph.D. in Computer Science from the University of Kentucky. His research focus is in computer vision and image processing with particular applications in the digital humanities. He is particularly interested in developing new methods to virtually restore documents and photographs.