Toward Digitizing All Forms of Documentation

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
March/April 2009

Volume 15 Number 3/4

ISSN 1082-9873

Toward Digitizing All Forms of Documentation

George V. Landon
Eastern Kentucky University
<george.landon@eku.edu>

Abstract

Large-scale book scanning projects are delivering unprecedented access to the majority of library holdings by giving users unparalleled access to vast collections of books. However, these efforts have focused on typical bound books, which leaves many forms of documentation out of the digitization framework. We present techniques that we have developed to digitize numerous other forms of documentation, including deteriorated manuscripts and photography. These technologies are engineered in conjunction with budgetary and physical constraints often placed on digitization projects.

Introduction

Recent large-scale document digitization initiatives have created new modes of access to modern library collections with the development of new hardware and software technologies. Most commonly, these digitization projects focus on accurately scanning bound texts, some reaching an efficiency of more than one million volumes per year. While vast digital collections are changing the way users access texts, current scanning paradigms cannot handle many non-standard materials. Documentation forms such as manuscripts, scrolls, codices, deteriorated film, epigraphy, and rock art all contain a wealth of human knowledge in physical forms that are not accessible by standard book scanning technologies. This great omission motivates our development of new technology for digitizing deteriorated bound works, damaged manuscripts, and disintegrating photographic negatives that remains cost-effective and easily utilized by non-technical staff.

After a brief overview of the current progress that large-scale digitization initiatives are making, we will address particular issues that arise when scanning damaged and/or deteriorating works. Furthermore, we demonstrate new methods we have developed for digitizing and virtually restoring two disparate documentation forms: manuscripts and photographic negatives.

Large-Scale Digitization Initiatives

The quest for amassing all of the printed works in the world is nothing new. The famous Royal Library of Alexandria, 300BC, began with the goal of acquiring all of the known world's knowledge (Rollin, 1815). With an estimated 400,000 to 700,000 volumes in the collection before the library's destruction, this goal seemed attainable.

According to the Online Computer Library Center (OCLC) WorldCat bibliographic database, there are 32 million records describing printed books in American libraries (Lavoie, Connaway, & Dempsey, 2005). This translates into at least 32 million unique books that must be digitized in the United States alone.

From a global perspective, attempts have been made to estimate the number of books printed in the world. Through methodical research and a fair number of extrapolations, it is thought that the world inventory of original books might be between 74 million and 175 million books (Lyman & Varian, 2003). However, this approximation does not include scrolls, manuscripts, and other documentation forms.

For the millions of books currently available on library shelves, recent projects have begun the daunting task of providing access to these volumes through new digitization and delivery methods. These projects have successfully started a trend to provide the world easy access to all printed materials. However, the world's digitization efforts should not be focused solely on the textual information of written or typed volumes. To fully understand what is lacking, one must first recognize the accomplishments that have already been made.

Google Book Search

In 2004, Google Inc. began Google Book Search, an effort to digitize and provide online access to virtually all of the world's books. The project claims to be able to digitize one million books per year (Google, Inc., 2007) with approximately 7 million books digitized to date (Rich, 2009).

The goal of Google Book Search is to make textual content in books searchable. This requires a great deal of image processing to accurately segment text from the background or illustrations. However, by converting these document images into easily segmented text, much of the contextual detail, such as graphics and illustrations, are greatly compromised.

An example scan from Google Book Search.

The original imaged title page.	The post-processed version included in the digital library.	The region showing compromised contextual detail.

While the omission of this content is understandable in practice, it does limit the overall presentation of a volume. Moreover, it should be noted that Google's digitization technology is not without restrictions. From the Google Book Search Policy (Google, Inc., 2007):

Google developed innovative technology to scan the content without harming the books. Any book deemed too fragile will not be scanned by Google, but may be treated by expert library staff. Once scanned, all print volumes are returned to the library collections.

Unfortunately, it appears that Google's digital library will not include many fragile works, which includes older volumes that have deteriorated with age.

Open Content Alliance

Started as a venture between the Internet Archive and Yahoo!, Inc. in early 2005, the Open Content Alliance is also working to make digital collections available. Focusing on improving availability of the content, the OCA has attracted contributors that are willing to donate their already digitized collections. Therefore, much of the collections are determined by the contributors as is the digitization technology employed.

Europeana

The European digital library, Europeana, was started in July 2007. While the initial digital collection contains roughly 2 million texts, images, audio files, and movies, the number is projected to reach 6 million items by 2010. The goal of Europeana is to amass all of Europe's documented cultural heritage by partnering with the content holders. Consequently, this initiative relies mainly on partner organizations to provide the digitized documents and metadata.

Digitization is not a One-Size-Fits-All Process

As the Google Book Search Project suggests, some texts are "deemed too fragile" to be handled and scanned by technicians, but "may be treated by expert library staff (Google, Inc., 2007)." To remove this extra distinction between documents, it would be more appropriate to develop systems accessible to the expert library staff. It is in the best interest of all parties involved in digitization projects to leave the handling of priceless works to those who are charged over them. Therefore, digitization is a task that should be available to librarians and curators.

Expanding notions of which documents need digitization and preservation also changes the perception of content for standard textual documents. Consequently, projects that focus mainly on document text lose much of the meaning and contextual information. This importance was addressed by the Library of Congress Manuscript Digitization Demonstration Project: Final Report (Library of Congress, 1998).

The committee discussed the degree to which an image need replicate the look and feel of an original manuscript (or other archival) artifact. Although most members agreed that Library of Congress treasures, e.g., drafts of the Gettysburg address, warranted a kind of museum-quality facsimile, there was less consensus that such treatment was warranted for routine documents, especially the kind of twentieth-century typescripts that form the greatest portion of the Federal Theater Project collection.

This argument resounds for the digitization of all historical documents. Therefore, document digitization cannot be viewed as a standard procedure that excludes many document types and compromises what content is preserved. Consequently, this drives our research to accurately capture, and in many cases restore, documentation in all its multifarious forms. A discussion of this research follows here.

Effects of Digitization Error

To engage in the discussion of document digitization, one must first understand the main problem that arises in digitizing collections. New technology has focused on scanning documents that are not compatible with normal digitization technologies. Due to deterioration and damage, it is impossible to place many older texts in a flatbed or book scanner. Consequently, these degradations not only affect the digitization technology but also greatly reduce the quality of the acquired image.

There are two main sources of error commonly encountered during damaged document digitization: photometric and geometric. We will now discuss recent developments to provide corrections for these types of digitization errors.

Error	Cause	Effect
Photometric	Non-uniform lighting	Inconsistent shading of content
Geometric	Warping of the substrate	Skewing of the content

Photometric Correction of Documents

Photometry measures the perceivable intensities of light by the human eye. The term photometric correction, within the scope of document digitization, applies directly to the perceived color of a document. Therefore, a photometrically correct document image must refer to an image of a document that contains intensities, or color, across the entire surface that exactly replicates the physical document. However, when a document has become warped from deterioration or damage, creating this photometrically correct document requires uniform illumination across the document – an impractical problem to say the least.

Image from the Rossano Gospels

Illuminated Rossano Gospels 550AD.
(Image shows Christ before Pilate)

Current conventions for archival imaging of non-planar documents require expertise that is beyond the reach of many institutions. Consequently, many digitization efforts are unable to produce accurate recordings of deteriorated texts. For example, the Rossano Gospels, shown above, contain high-frequency wrinkles that greatly reduce the quality of the text and illuminations.

An Automatic Correction System for Damaged Documents

To address the photometric error often introduced in document scanning, without reducing the contextual information, we have developed an automated system to scan and correct these documents with accurate color information. Previous techniques have generated very accurate results by limiting documents to a certain format. For example, the photometric error introduced when scanning bound books, which causes much darker areas to appear near the binding, can be automatically removed (Wada & Matsuyama, 1992). Other works assume a known border around the document and interpolate the shading through the text to correct the error (Brown & Tsoi, 2006). To provide a photometric correction technique that works in general cases, we introduce using a simple light calibration method (Landon, Lin, & Seales, 2007). This application provides solutions that bypass the need for professional photographers and expensive lighting equipment.

The result of our work produces a method that corrects lighting and shading in document images without regard for particularities in the document content. Therefore, any type of book or manuscript can be restored with no variation of result if the document contains text, illustrations, or photographs. This process allows the use of simplistic lighting equipment and the novel paper light probe to restore the adverse imaging effects caused by inadequate illumination.


Newspaper with light probe	Close-up of original	Close-up of corrected document

The paper light probe is a simple piece of white letter paper with uniform folds as shown at the bottom of the left figure. We introduce this form of a light probe due to its ease of construction and readily available construction material. The 3D shape information of the probe will provide enough data to calculate the 3D position and color of the light source illuminating the scene.

A 3D scanner is used to acquire the shape and color information of both the light probe and the document to be digitized. To restore the color information, the light position and color are calculated using the known surface normals, imaged color, and known color of the paper light probe to solve the general irradiance equation. Once the light source color and position are known, the intrinsic color of the document can be restored using the document's surface normals and imaged color.

The entire system uses the following components:

3D scanner (NextEngine, Inc., 2009)
laptop computer
folded piece of white paper
single high-intensity light

The process is performed automatically with a total equipment cost of less than $5,000. Therefore, library staff can acquire digital scans of a warped document and easily restore its color information without additional intensive training or organization resources.

When the document is corrected photometrically, the color distortions perceivable by the observer are diminished. However, for many document applications, the uncorrected geometric distortions still reduce content detection and legibility.

Geometric Correction of Documents

Acquisition and restoration of general document types has been given focus by many groups who have made a great deal of progress in creating fast and accurate digitization systems.

While others have relied on assumed document shapes to provide photometric and geometric corrections for objects such as bound books (Cao, Ding, & Liu, 2003) and folded documents (Brown & Tsoi, 2006), some works have focused mainly on geometric correction of arbitrarily distorted documents (Zhang, Zhang, & Tan, 2008) (Brown & Seales, 2004).

image of an illuminated Latin Bible

Illuminated Latin Bible 1407 AD by Gerard Brils, Belgium.
(Photo Courtesy A. Pingstone 2005)

The Latin Bible, shown above, contains low-frequency warpage, which reduces legibility for some areas of the text.

Discussion of the geometry of a document simply refers to its physical shape. Digitizing non-planar documents with standard methods introduces shearing and skewing distortions of the content. For modern texts, damaged volumes can be physically manipulated to reduce these geometric distortions. However, for conserved texts, this type of handling is not an option.

There have been two main approaches to correcting geometric distortions in document images.

First, image-based methods rely on a combination of assumptions about the content/context with known constraints on the document. These methods look either at the actual content of the document to determine where distortions occur or at the document composition such as distorted paper edges or bindings. A model is fit to these distortions, which can be reversed. This inversion effectively reduces these distortions for many documents.

Second are physically based simulations of document unwarping. Using the complete three-dimensional geometry of the document, the surface is modeled as malleable material that will deform when interacting with external forces. Many examples exist where the geometry and composition of the document surface is complex enough to prohibit accurate three-dimensional and image-based acquisition. Moreover, many commercial 3D modeling applications now include these advanced simulations under the guise of cloth modeling. We have had particularly good results flattening a document with the Autodesk 3D studio Max cloth simulation tools.

The combined application of the photometric and geometric restoration techniques corrects imaging errors for many documents. However, as technology stands today, there are a large number of documents excluded from this acquisition and restoration technique. For example, paper with fine folds and ridges is often difficult to accurately digitize with scanning equipment available to the practitioner.

A New Paradigm for Document Scanning

Many historical documents contain deteriorations in the surface, such as wrinkles and creases that have details, which are undetectable by many 3D imaging devices. In these cases, we have developed an advanced model for each digitized pixel in the scanned image, which provides a technique to compensate for lack of resolution in the document sampling.

image a document scanner

New document scanner uses only
a laptop computer and digital camera.

This pixel-wise image-based acquisition and restoration methodology exploits the transmissiveness of documents to obtain the intrinsic color and distorted shape (Landon, Seales, & Clarke, 2008). The digitization process uses

an LCD monitor, and
a Digital video camera

both controlled by the same computer to effectively scan the object.

The document is placed on top of the display, and light patterns are rear-illuminated on the document. As these patterns are displayed, the camera observes the illumination distortions introduced by the document, and an overall model is generated to compensate for these discrepancies. An assumption is made that the scanned document transmits light and does so in a diffuse manner, which means that light leaves the surface of the document at a uniform intensity, at all angles, from the surface.

An initial, one-time-calibration scan is performed to build a baseline dataset for the scanner. Then documents are placed on the laptop display, and the light patterns are repeated. It is the change in light transmission when the document is in place that generates the model for the document content.

The generated model for each pixel allows the photometrically correct content to be extracted directly. In addition, the model provides document shape information that can be used to restore the shape of a geometrically warped substrate with geometric restoration procedures.


Deteriorated phone directory	Acquired shape information	Close-up of a region of the directory	Acquired document content

Our image-based technique provides a direct way to generate restored images without knowing any additional information about the document or the scene set-up. Consequently, this allows full access to the scanning technology in terms of required skill set and cost effectiveness. The operator needs to roughly align the computer display and camera, then place each document to be scanned on the display and digitize it with a single mouse click.

The applications of this scanner go beyond standard manuscript scanning. Since the scanning technology works independently of the document type, documents that are more complex have been validated for this digitization technique. In fact, even multi-layered documents that remain semi-transmissive, such as photographic negatives, can be digitized and restored with this same approach.

Moving Beyond Standard Document Archetypes

The earliest human documents, such as rock paintings and carvings are records of heritage and beliefs – the same motivations that compel many modern authors to write. For example, petroglyphs were often used to record religious icons. See the example from the Taíno people below.


Rock art	Epigraphy	Photography

Even in early written recordings, stone continued to be the canvas of choice for many types of legal documentation. This is evident in the vast amount of epigraphic recordings created in the Greek and Roman empires. Massive pieces of stone were used to record information, ranging from historical chronologies to listing donors for building projects.

As new materials such as papyrus and parchment emerged, the creation of documents increased in speed and extent. The composition of these writings came in roll, codex, and leaf form. With the volume of books increasing, there were an estimated 30,000 books in Europe before the creation of the moveable-type printing press (Harry Ransom Center: The University of Texas at Austin).

Much like early rock art and paintings, original photographs provide another source of historical perspective. Commonly used for many recording purposes, early photographs were often commissioned as a documentation technique. The subjects of these recordings range from family portraits for genealogical purposes to architectural images used to capture economic and locale information. From early photography to current, these visual records provided an irreplaceable glimpse into the past.

Thus far, the focus of document restoration research has been on acquiring virtual copies of bound books or manuscript pages. Consequently, restoration techniques for other document types have been grossly underdeveloped. Using the previous technology specifications, it is safe to begin moving to other forms of documentation. The two important pieces of information that give all the necessary information for almost all types of documents are:

intrinsic color information
shape information

An obvious example of these limitations is deteriorated photographic negatives. Many photographic negatives, dating from 1925 to 1955, have deteriorated in various layers making the negatives unprintable (Horvath 1987). This deterioration has affected many photographic archives including the Library of Congress, the National Archives, and many state and university collections. The broader implications for these historic collections are immeasurable.

For an overview of photograph restoration techniques, the reader may refer to (Stanco, Ramponi, & de Polo, 2003). However, these methods generally focus on a scanned image of a photograph and usually only handle standard photographic prints. One project that does work directly with glass plate negatives (Stanco, Tenze, Ramponi, & de Polo, 2004) uses rigid transformations to assemble broken photographs.

The standard restoration technique for acetate negatives requires a chemical-based physical separation of layers in a permanent and destructive manner. While this technique provides the restored image, it requires a vast amount of expert time and leaves the photograph forever altered. It should also be noted that the degradation of these negatives cannot be stopped; only slowed. Therefore, digitizing these historic photo-negatives for virtual restoration is a technology that must be researched and developed.


Imaged negative of a grave marker, Lexington, KY	Positive image

Automatically acquired surface information	Automatically acquired photographic content

Present digitization procedures are not deployable for these multi-layer documents. Due to a great deal of non-uniform separation between layers, the final photographic image contains many channeled photometric distortions. However, our transmissive document scanner, already discussed, has been demonstrated to effectively digitize these non-standard materials. Since this scanner works in a fully image-based modality, it alleviates many complications that arise with complex materials stacked layer upon layer such as accurate three-dimensional acquisition or feature detection. We show examples of negatives from the University of Kentucky Libraries Special Collections and Digital Programs.

As we show in the figure on the left below, the damage due to the so-called vinegar syndrome of acetate negatives greatly reduces the quality of the negative print. However, the results of our scanning and restoration method move much closer to providing the undistorted content of the original emulsion layer. By analyzing the difference in light attenuation between the empty scanner and the scanner with a negative in place, we are able to extract two unique forms of data in addition to the typical digitized print. First, the ratio of light from the display area light transmitted through the empty medium to light from the display area transmitted through the negative gives an estimation of intrinsic photographic content from the emulsion layer. Second, the shift in the perceived light source through the differing mediums gives an estimation of surface angle. The figure on the right below shows all three of the distinct datasets acquired from a single negative scan using our scanner.


Zoom of uncorrected negative scan (white area signifies data loss due to cracking).	Photometrically corrected content with geometric flattening applied.

Once these datasets are performed, we are left with a data-set similar to the previously discussed method that uses an off-the-shelf 3D scanner. Therefore, we can now apply the same geometric correction technique to remove shape distortions in the photometrically correct photographic content.

Conclusion

This work presents our research to develop new methods for document acquisition and restoration. These technologies will enable both large-scale and intra-institutional digitization initiatives to include deteriorating and damaged document types while retaining high-quality representations of the content. Documents that were once deemed too fragile or damaged to be digitized now have a solution to allow inclusion in digitization pipelines.

Acknowledgements

The University of Kentucky Libraries Special Collections and Digital Programs generously provided access to the deteriorating acetate negatives shown here.

Bibliography

Brown, M., & Seales, W. B. (2004). Image restoration of arbitrarily warped documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (10), 1295-1306. <doi:10.1109/TPAMI.2004.87>.

Brown, M., & Tsoi, Y. (2006). Geometric and shading correction for images of printed materials using boundary. IEEE Trans. Image Proc., 15 (6), 1544-1554. <doi:10.1109/CVPR.2004.115>.

Cao, H., Ding, X. Q., & Liu, C. (2003). A cylindrical surface model to rectify the bound document image. ICCV '03: Proceedings of the Nineth IEEE International, (pp. 228-233). <doi:10.1109/ICCV.2003.1238346>.

Google, Inc. (2007) CIC/Google Book Search Project: Frequently Asked Questions <http://www.news-releases.uiowa.edu/2007/june/060607google-faqs.pdf>.

Harry Ransom Center: The University of Texas at Austin. (n.d.). The Digital Gutenberg Project. Retrieved January 15, 2009, from <http://www.hrc.utexas.edu/educator/modules/gutenberg/books/legacy/>.

Horvath, D. G. (1987). The Acetate Negative Survey: Final Report. The University of Louisville, Ekstrom Library, Photographic Archives, Louisville, KY 40292, February 1987.

Landon, G. V., Seales, W. B., & Clarke, D. (2008). A new system to acquire and restore document shape and content. Proceedings of the 5th ACM/IEEE international Workshop on Projector Camera Systems (pp. 1-8). Marina del Rey: ACM. <doi:10.1145/1394622.1394638>.

Landon, G., Lin, Y., & Seales, W. (2007). Towards Automatic Photometric Correction of Casually Illuminated Documents. IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR '07., (pp. 1-8). <doi:10.1109/CVPR.2007.383254>.

Lavoie, B., Connaway, L. S., & Dempsey, L. (2005). Anatomy of aggregate collections: The example of google print for libraries 11 (9). D-Lib Magazine. <doi:10.1045/september2005-lavoie>.

Library of Congress. (1998) Library of congress manuscript digitization demonstration a project: Final report. October 1998. <http://lcweb2.loc.gov/ammem/pictel/>.

Lyman, P. & Varian, H. R. (2003). How Much Information, Retrieved November, 14, 2008, from <http://www.sims.berkeley.edu/how-much-info-2003>.

NextEngine, Inc. (2009). Retrieved January 15, 2009, from <https://www.nextengine.com/indexSecure.htm>.

Rich, M. (2009, January 5). Google Hopes to Open a Trove of Little-Seen Books. Retrieved January 15, 2009, from The New York Times: <http://www.nytimes.com/2009/01/05/technology/internet/05google.html>.

Rollin, C. (1815). The ancient history of the Egyptians, Carthaginians, Assyrians, Babylonians, Medes & Persians, Macedonians, and Grecians. Silas Andrus.

Stanco, F., Ramponi, G., & de Polo, A. (2003). Towards the automated restoration of old photographic prints: a survey. EUROCON 2003. Computer as a Tool., (pp. 370-37). <doi:10.1109/EURCON.2003.1248221>.

Stanco, F., Tenze, L., Ramponi, G., & de Polo, A. (2004). Virtual restoration of fragmented glass plate photographs. 12th IEEE Mediterranean Electrotechnical Conference, 2004. MELECON 2004, (pp. 243-246). <doi:10.1109/MELCON.2004.1346819>.

Wada, T., & Matsuyama, T. (1992). Shape from Shading on Textured Cylindrical Surface: Restoring Distorted Scanner Images of Unfolded Book Surfaces. IAPR Workshop on Machine Vision Applications, (pp. 591-594). Retreived February 3, 2008 from <http://b2.cvl.iis.u-tokyo.ac.jp/mva/proceedings/CommemorativeDVD/1992/papers/1992591.pdf>.

Zhang, L., Zhang, Y., & Tan, C. (2008). An improved physically-based method for geometric restoration of distorted document images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30 (4), 728-734. <doi:10.1109/TPAMI.2007.70831>.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/march2009-landon

D-Lib MagazineMarch/April 2009

Volume 15 Number 3/4 ISSN 1082-9873

Toward Digitizing All Forms of Documentation

Abstract

Introduction

Large-Scale Digitization Initiatives

Google Book Search

Open Content Alliance

Europeana

Digitization is not a One-Size-Fits-All Process

Effects of Digitization Error

Photometric Correction of Documents

An Automatic Correction System for Damaged Documents

Geometric Correction of Documents

A New Paradigm for Document Scanning

Moving Beyond Standard Document Archetypes

Conclusion

Acknowledgements

Bibliography

Copyright © 2009 George V. Landon

D-Lib Magazine
March/April 2009

Volume 15 Number 3/4

ISSN 1082-9873