Stories

Spacer  

D-Lib Magazine
June 2000

Volume 6 Number 6

ISSN 1082-9873

Lessons Learned

Digitization of Special Collections at The University of Iowa Libraries

Spacer Line
Spacer

Carol Ann Hughes
Director, Collections Management
Questia Media, Inc.
Formerly, Interim Director, Information and Research Services
The University of Iowa Libraries
[email protected]

Spacer Line
Spacer

Introduction

In 1998, The University of Iowa Libraries was the recipient of a grant from The Library of Congress/Ameritech National Digital Library Competition in support of a project to digitize a portion of its unique Redpath Chautauqua Collection. The competition, which was funded for three years by a $2-million partnership between The Library of Congress (LC) and the Ameritech Corporation, made awards to over 30 institutions with the aim of supporting the digitization of a range of materials of significance to the social and cultural history of the United States.

One of the goals of the competition was to encourage U.S. libraries, archives, museums, and historical societies to assist in the development of the American Memory electronic collections. But in making the awards, the LC/Ameritech program also offered many institutions that would not otherwise have had sufficient institutional resources the opportunity to explore by "getting their hands dirty" the sometimes confounding issues involved in planning and carrying out a complex digitization project.

The Redpath Chautauqua Collection

The collection chosen for the award is a series of flyers promoting performers who could be booked for performances across North America through the Redpath Bureau, Chicago, the foremost Chautauqua booking agency in the U.S. The circuit Chautauqua Movement represents an early embodiment of the American drive for self-realization and self-improvement. Programs included discussions on political, scientific, and moral topics, as well as entertainment for all ages of both high and popular culture. Thus, the project was named: "Traveling Culture: Circuit Chautauqua in the Twentieth Century."

Comprising some 648 linear feet of materials dating between 1890 and 1940, the Redpath Collection is considered to be the most extensive holding of circuit Chautauqua materials in existence. The talent portion is the largest series (about 139 linear feet) in the collection. It consists of nearly 8,000 flyers, averaging four pages in length, promoting the talents of over 4,500 performers. As a record of the "business of culture" at the turn of the century, it is a natural resource to select for presentation to the widest possible audience via the web.

The spirited language and vivid artwork used in the promotional flyers reflect the emotions and ideals of the Chautauqua movement. Presentation of the flyers in digital form makes the possibly dry topic of U.S. cultural history come alive to the casual and serious researcher alike. For instance, young students can discover that the public's fascination with pet tricks did not begin with David Letterman. Rather, they would be able to see that Pamahasika's 50 Highly Educated Pets were a popular source of entertainment in the early part of the century. The general public of today would be just as curious as their great-grandparents about tales of hardship and misfortune, such as the experiences of Mrs. Florence E. Maybrick. Unjustly convicted of poisoning her husband, she was sentenced by a judge who went insane shortly thereafter. Yet she served 14 years of her punishment and then went on the circuit to lecture on prison life and the need for judicial and prison reform.

Initial Plan for Workflow

Because the text and the graphic layout of the flyers are highly valuable for the study of American popular culture, the digitization strategy had to accommodate both. As with most advertising material, there is no standard presentation, design, or typeface, although it is rare that any flyer is more than four pages in length. The physical state of the flyers was good, but the interfiled correspondence was in such a state that a preservation effort (funded internally) simultaneous to the digitization effort was essential. Therefore the project had four distinct workflows that had to be managed separately within the grant's timeframe (18 months) with an eye toward re-assembling all the pieces in both physical and digital form in the final phase. The workflows included: preservation photocopying, keying/encoding the textual information on the flyers for full text searching, cataloging the individual flyers, and imaging the flyers. The workflow for preservation photocopying will not be addressed here, but it was the topic of significant attention for much of the grant period.

The flyers are irreplaceable. No consideration was given to outsourcing the imaging process. Iowa City is not near enough to a major urban area to allow a vendor to come on-site to perform outsourced tasks. Therefore, the budget and workflow plan had to accommodate library staff for the imaging workflow, which meant that full-time staff would be needed to train, supervise, and provide quality control for the bulk of the work that would be done by student workers. There would be no possibility of also funding full-time staff to supervise students who might be digitizing/SGML encoding the text, especially considering the timeframe of the grant. Therefore the task of keying and "light" SGML encoding, based on a TEI-Lite compatible DTD, was outsourced to a vendor through a bid-process managed by the university's purchasing department.

Over the years, researchers and staff have not been uniformly attentive to the need to maintain the materials in the original order. It was anticipated that we would need to update our description of the collection's size and shape as part of the grant development process. The number of flyers in the collection and the number of unique performers was estimated by a random sampling of one folder in each of the 256 boxes of folders comprising the series. This process indicated that there were 9,600 flyers representing 7,600 different performers. However, as we launched into the workflow it was discovered that there were fewer unique performers (about 4,500) represented by about 8,000 unique flyers. Lesson: Study carefully the dimensions of the project you are undertaking and then add in sufficient resources to accommodate the unexpected, especially if one is working in a special collection of substantial size and variety.

Text Keying/Encoding

As mentioned above, the initial estimate of the number of flyers to be handled was lower than expected based on initial samples. Regrettably, this was not the case with the keying/SGML encoding estimates.

The Library of Congress encourages SGML encoding of textual materials in the American Memory project in support of full-text searching and to serve as the basis for more sophisticated search and discovery strategies that might be available in the future. Although the talent flyers are highly graphical and the nature of a promotional flyer might not seem to warrant full text searching, the quality of the text and the richness it provides as context to the images led the project team to concur with a plan for encoding the flyers. However, the team also decided that deep encoding was not necessary and only nineteen elements were identified for the vendor's tagging work. The elements selected were primarily structural, like <figure> and <cell>. Other elements focused on the types of text that might be likely to provide special insight into contemporary culture and style, like citations and quotes from verse. It was reasoned that if use of the collection warranted more encoding, it could be inserted at a later date when more money was available for such activities.

Initial estimates of 28 million characters were based on counting the number of characters (including spaces) in a relatively small sample of flyers. Unfortunately, the sample proved to be skewed and the number of characters required for keying and encoding was underestimated by about 100 percent, even considering the fact that there were fewer flyers than was first assumed. Part of the under-estimation was due to inexperience in estimating the number of keystrokes that would be involved in the insertion of the SGML tagging. This miscalculation was compounded with the discovery that there were quite a few flyers that had substantial textual material. Thus, the team severely underestimated the number of characters to be keyed, which led in turn to an underestimation in the budget required to support this dimension of the project. Lesson: Take the time to do a substantial sampling of the textual materials to ensure that one has a solid sense of the amount of text to be keyed, and then add at least 25% for tagging character strokes.

Fortunately, the commitment of the University Libraries administrative team to this project was strong, and the commitment of the vendor to the project was equally deep. Negotiation of both the fee and the logistics (with both sides accommodating some of the extra work) made it possible to finish the keying. Lesson: Find a vendor who considers the venture to be a partnership and with whom you can communicate frequently. And prepare the library administration for the unexpected throughout the project.

One other unanticipated wrinkle in the workflow occurred during the keying process. The quality of the photocopies to be used by the vendor in the keying and encoding process was often difficult to decipher. The flyers make great use of shading for background effect, and the text is artistically displayed in a variety of fonts around and across multiple images. This is one reason that OCR was not considered for the keying process; another is the poor quality of affordable OCR software, which makes necessary a great deal of proofreading and correction.

Photocopies of text printed in black font on a gray or colored background were in many instances illegible to the keying staff. It is especially important to have completely legible photocopies for keying staff who may not be proficient enough in the language of the text to decipher a poor reproduction. Lesson: The quality control procedures were sufficient for the preservation photocopying, but we did not employ the same procedures for photocopies to be used by the keying vendor. Now the team needs to devise a method for filling in the <gap> and <illegible> tags.

Intellectual Access

Intellectual access to "Traveling Culture" is provided through a number of avenues. The finding aid for the collection has been encoded in both HTML and SGML according to the EAD DTD and it provides a box list of performer names. Each flyer is individually cataloged according to MARC compatible standards by carefully trained students. Bibliographic data are included in the TEI Header and will also reside in a database at the Library of Congress. The homepage for the project will provide both full text searching and access to flyers through a browse list of subject terms based on LC Subject Headings and the LC Thesaurus for Graphic Materials I & II. Local subject terms have also been applied when appropriate, especially to provide description of the context of the piece as well as its content.

The Redpath Collection has been cataloged at the collection level for quite a while, but cataloging at the flyer level was a requirement of the grant. Workstation-based packages for simple MARC cataloging did not suit our needs. We wanted to catalog the item once and then automatically export a MARC file for LC as well as a TEI header with the bibliographic data inserted in the correct elements. So we developed a form-based entry system that enters data into a File-Maker Pro database using Tango for the web interface.

After cataloging each flyer, staff have an additional task that provides the lynch-pin for the entire process: assigning the file naming structure to be used as the unique identifier for linking together the TEI headers with the text files and the page images of the flyer. The basic directories are based first upon file type: "sgm" for the text, "gif" for the thumbnail image, "jpg" for the JPEG images. The second level of the directory path indicates the performer, sequential number of unique flyer in the collection by that performer, and the individual page image. Therefore:

sdrc/traveling-culture/chau1/img/stagg/3/5.gif

indicates the Traveling Culture collection within the University Library's Scholarly Digital Resources Center site (SDRC), the talent flyer series (chau1), the image database, the performer named Alonso Stagg, the third unique talent flyer in the box, and the gif image of page 5 of that flyer.

Requiring that the file naming structure reflect a particular performer has both advantages and drawbacks. One advantage is that it reflects the traditional structure of the collection, which has always been based upon performer name. The naming procedure makes sense to human beings. This assists in the construction of the unique identifier for the files and may assist humans in quality control when linking the page image, text, and header files together.

However, as more flyers by the same performer are found or as more flyers by performers with similar name stems are cataloged, the system begins to be subject to transcription/construction errors. No change in the file naming process is planned, but these considerations will inform our choices in the next project. Lesson: Plan for as much extensibility in naming structures as possible, even when dealing with collections that are no longer growing in size. It may be the case that machine-assigned/random file names provide the most flexibility in the long run for large text collections, especially those that may expand unpredictably in coverage.

Imaging

Imaging has been accomplished in-house by student staff with equipment provided by the Office of the Vice-President for Research. Special attention to simplifying and automating procedures, as much as possible, has been a goal of the project team to minimize training time and error rate.

The general procedure is as follows: the original flyers arrive from the catalogers with a photocopy of the title page indicating the unique identifier/file name stem for the flyer. Students scan the flyer pages creating 600dpi 32-bit color TIFF images (a time-consuming process) that are saved to Jaz disks. The Jaz disks are used to burn archival CDs. Each CD is labeled sequentially with a note indicating which performer's flyers/pages are included. The uncompressed color TIFF files are so large that each CD averages only 6 images.

An automated program creates a 300dpi 32-bit color file on the workstation desktop from the 600dpi TIFF image after it is saved to the Jaz disk. The 300dpi color file is again downsampled into one 300dpi 1-bit black/white PDF file, one 150dpi color JPEG, one 72dpi color GIF image and one 6dpi color GIF thumbnail. All images for a flyer are saved in one folder on the server. The 300dpi color file is then discarded.

We originally tried to use Debabelizer for the downsampling, but that software could not rename the files according to our naming scheme as they were downsampled. So we use a semi-automated procedure based on Photoscripter. Quality control on the imaging initially identified a "waffle-effect" in the GIFs that was caused by a script error so we have had to repeat the downsample process, but that is an automated batch process from the 150dpi color JPEGs residing on the image server.

GIF images at 6dpi are used as thumbnail navigational aids in the search interface; 72dpi GIFs are the default display images. The JPEG image is an option that may be selected by the user who wishes to view or download a higher quality image. PDF images are provided for ease of printing.

Many talent flyers contain images that span two pages, or even three. The project team made a strategic decision that images that span pages should be presented to the public as coherent whole images. This requires that the image be "stitched together" from the individual page images after they are downsampled, a job that requires patience, time, and a keen artistic eye. This is a task for one highly skilled team member. Lesson: Consider carefully whether one can create automated routines to prevent the need for extensive training and lots of "hand-work." The budget impact of the need for special handling can be substantial.

Putting it All Together

The strategy for presenting the collection to the public was the subject of much discussion during the grant writing phase. Should "live" SGML be presented to the public? That would involve customized scripting to convert SGML to HTML for display purposes. This is possible but it requires staff expertise that the University Libraries does not ordinarily have available. After deep consideration, it was decided that the highly visual nature of the flyers would be equally well served by presenting only page images of the flyers to the public and using the encoded text as the basis for full-text searching behind the scenes. Lesson: This was the right decision! As the project progressed there were many other substantial issues to address. The time required to test and perfect the SGML encoding and to write scripts to translate text on the fly to HTML would have been over-burdensome and too costly.

General Lessons Learned: This project was accomplished largely through the efforts of student staff, especially undergraduate student staff. All of the imaging, and much of the cataloging, was done by students with hourly appointments. Although hiring, training, and scheduling student staff is time consuming, the quality of the work and their facility with the technology has been quite satisfactory. The final budget report for the grant indicates that we came within $20,000 of what we anticipated spending for the cost-sharing portion of the work. This would not have been possible without the use of hourly staff. We would not hesitate to employ students in the next digitization project.

Technical and work-schedule "hiccups" have occasionally delayed work, but the people and equipment have generally worked well. We could use more students, more equipment, and more hours in the day, but the experience gained in this project has been invaluable as a library-wide development and enrichment activity. Nothing replaces the opportunity to "get one's hands dirty."

In August 1999, the project staff gave a presentation to the Libraries about the project. Our colleagues were impressed with the complexity and scope of the work and were glad to learn more about exactly what steps were involved in the process. But what impressed us was their equal conviction that the Redpath Chautauqua Collection is indeed a treasure that should be available to the world. We share that conviction and look forward to offering "Traveling Culture" to people around the world.

URLS of Interest

Library of Congress American Memory: < http://memory.loc.gov >.
The University of Iowa Libraries Information Arcade: < http://www.lib.uiowa.edu/arcade/ >.
Traveling Culture: < http://sdrc.lib.uiowa.edu/traveling-culture/ >.

Copyright � 2000 Carol Ann Hughes
<img src= Line
Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | Next Story
Home | E-mail the Editor
Spacer Line
Spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/june2000-hughes