Muriel Foulonneau, Thomas G. Habing, Timothy W. Cole
The practice of including thumbnails in short record displays, increasingly common in local implementations, is being adopted by metadata aggregation service providers as well. In addition, thumbnails and Web thumbshots have begun appearing as part of Web search results. This article reports on a project at the University of Illinois at Urbana-Champaign (UIUC) to make more comprehensible heterogeneous resources available on the UIUC CIC metadata portal by incorporating thumbnails and thumbshots of image and Webpage resources in the context of the OAI Protocol for Metadata Harvesting. In addition to thumbnails provided by partner data providers, UIUC has developed an automated process to generate thumbnails and thumbshots from the Webpages resources pointed to by the metadata records.
Metadata describing digital resources are most often created with local applications in mind. The heterogeneity and limited scope of such metadata create challenges for metadata aggregation services. Not all of these challenges are confined to the search process. A service provider will often have occasion to list results from multiple data providers' collections, whether in the context of an OAI-based aggregation (Open Archives Initiative Protocol for Metadata Harvesting), in some other cross-repository searching context, or even in the context of more traditional search engines. Users must obtain enough information from such displays to select the resources of interest. (The "SELECT" task alluded to here is described more fully in the FRBR model. ) A service provider must present adequate information in a "snippet" or brief record display to allow a user to assess relevance of resources quickly and efficiently.
A study of how student teachers used an early OAI metadata aggregation providing access to digital cultural heritage resources  suggested that users expect such portals to portray the promise and potential usefulness of resources in search result displays. These users reported considerable frustration, because too often links presented in result lists turned out not to be useful or relevant. Experience with Web search engines and image databases suggests that including a downgraded version of the resource itself on search result displays may help users assess the relevance of resources. Michelle Chang et al. note the common practice of displaying thumbnails in results from image database searches . Similarly, a user study on the Charles W. Cushman image collection at Indiana University reported that, "All participants felt that the default display results format for the brief view should be thumbnails and captions sorted in descending chronological order" .
The practice of including thumbnails in short record displays, increasingly common in local implementations, is now being adopted by metadata aggregation service providers as well. Picture Australia, an OAI-based portal for images of Australiana, uses snippets that consist solely of title and thumbnail. The Research Libraries Group Cultural Materials portal provides thumbnails of aggregated resources. Several ongoing OAI aggregators have expressed interest in presenting graphic surrogates for the resources they harvest, notably the CIC metadata portal (our project, described below), the American West project, and the 24 Hour Museum portal in the UK .
As an analog to image thumbnails, Webpage thumbshots also have begun appearing as part of Web search results. Thumbshots are thumbnail-sized images created from Webpage snapshots and are useful surrogates for non-image resources. Studies demonstrate that thumbshots, especially when coupled with textual extracts, can improve the selection of relevant resources without significantly increasing user time to evaluate the relevance of a resource . The Exalead engine  and the BetterSearch Firefox extension  include thumbshots in search results. Thumbshots.org generates thumbshots for homepages and allows Web authors to include these thumbshots when presenting links to the associated homepages .
The OAI Protocol for Metadata Harvesting provides no explicit guidance regarding the use of thumbnails and thumbshots. The Museums and Images JISC-FAIR Cluster Group in its report on Images in the Harvesting Model affirmed the desirability of using thumbnails in the OAI context . To make use of thumbnails and thumbshots, service providers must consider methods to collect thumbnails, publish them, integrate them in the overall aggregation process, and then assess their impact. Not all data providers have thumbnails to make available. For those that do, there is not yet consensus on how to make metadata aggregators aware of thumbnail availability.
In this article we report on our experiences at the University of Illinois at Urbana-Champaign (UIUC) using thumbnails and thumbshots within an OAI context. Our objective is to make the heterogeneous resources available on the CIC metadata portal  more comprehensible, facilitating selection of resources by end-users. In collaboration with participating CIC libraries , we have developed and implemented strategies to collect existing thumbnails and to automatically generate thumbnails and thumbshots of image and Webpage resources.
2 Strategies to obtain thumbnails and thumbshots for the CIC metadata portal
The CIC metadata portal aggregates and makes searchable metadata describing over 500,000 mostly digital resources from 11 Midwestern research universities. Resources are images, sound, video, theses and dissertations, ePrints, online ebooks, Websites, finding-aids, etc. Each participating institution has implemented at least one OAI data provider from which the UIUC service provider harvests metadata. None of the metadata originally included links or references to thumbnails; although, some institutions created thumbnails for their local Websites. Three possibilities were considered:
The first two approaches require additional effort from the data providers; however, the data providers retain control of thumbnail generation process, its quality, and coordination with the actual resources. The last approach relies on the service provider. For the CIC project we implemented approaches 1 and 3. An open source application, Thumbgrabber, was developed to capture and maintain thumbnails and thumbshots. Figure 1 illustrates how thumbnails are used in listings of results from the CIC-OAI Metadata Search Portal.
3 Remote thumbnail capture as part of the metadata enhancement process
Thumbgrabber was developed to generate thumbnails and thumbshots from Web resources referenced in descriptive metadata. Thumbgrabber is currently used to maintain about 38,000 thumbnails for items from 16 collections held by 6 CIC metadata repositories. Tests have been performed on 8 other collections from 7 additional repositories. Thumbgrabber also has been used to capture thumbshots. Thumbshot capture is performed as a way to create graphic surrogates for Websites, digitized books, and other multi-part Web resources treated as individual items within the CIC metadata aggregation. The process flow relating to Thumbgrabber as used for the CIC project is depicted in Figure 2.
3.1 Identification of the URL to spider
Data providers make choices concerning the URL they provide in the descriptive metadata made available for harvesting. An ePrints technical report by Tourte  highlights the difficulties in trying to identify from harvested metadata the correct URL to use to get to primary ePrints resources.
To accommodate this problem, an XSLT is applied to CIC harvested metadata records (harvested in MODS, qualified Dublin Core, or unqualified Dublin Core) in order to identify the source URL from which the thumbnail should be captured (a Thumbsource element).
3.2 The thumbnail capture process
From a starting directory of harvested metadata files, Thumbgrabber loads each metadata file, looking for the <Thumbsource> element URL.
Thumbgrabber supports three capture methods. The "Capture largest image only" method looks at all images on the Webpage, including across frames, and generates a thumbnail using the largest image found. This method is also used to create thumbnails from URLs pointing directly to image resources. The "Capture HTML Webpage only" method captures a thumbshot of the visible portion of the Webpage as it is rendered by the Microsoft Internet Explorer Web browser in a window sized according to the "Web Page Size" input parameter. The "Capture largest image or HTML Webpage" method first looks for the largest image. If there is no image of adequate dimension, then a thumbshot of the entire Webpage is created instead.
The behavior of the Thumbgrabber can be customized according to which collection an individual metadata record belongs. (The CIC metadata transformation process includes the creation of a <isPartOf> element for each metadata record, which contains a unique collection code within the CIC environment [15, 16].) Collection-specific inputs to Thumbgrabber are specified in an XML file (example shown below in Figure 3). For collections of Websites and for collection homepages, the "Capture HTML Webpages only" method is applied. For mixed collections such as the UIUC "Teaching with Digital content" collection, the "Capture largest image or HTML Webpage" method is applied. Various fields in the collections XML file, such as <Capture>, <ThumbWidth>, <PageWidth>, etc., provide additional inputs to the Thumbgrabber.
Figure 3 - Section from the collections.xml file
If a thumbnail image is successfully captured for a specific metadata record, a new element called <thumbnail> is added to the record. In order for this element to point to a Web addressable location, a "Base URL" is provided to the program, giving the URL equivalent for the machine-specific "Dest Directory" path specifying where the thumbnail should be saved. For example, if the full path to a saved thumbnail is "c:\thumbs\thumb0001.png", the Dest Directory parameter for the run is "c:\thumbs\", and the base URL is "http://some.edu/oai/thumbs/", then the URL stored in the thumbnail element of the updated metadata record will be "http://some.edu/oai/thumbs/repo1/thumb0001.png."
Thumbgrabber can be controlled either through a graphical user interface (GUI), as shown in Figure 4, or via parameters passed to the program through the command line. (Additional details on all input fields can be found in the documentation provided with the open source distribution on SourceForge .)
4 Technical description of the capture and conversion process
The challenge of thumbnail and thumbshot capture is to reliably create consistent image surrogates from URLs for both Webpages and images as variously referenced in harvested metadata records. Thumbgrabber must therefore handle the diversity of technologies and potential instabilities encountered in the Web environment.
4.1 The rendering of resources using Internet Explorer
Thumbgrabber is designed to create thumbnails that preview the resource as it is intended to be viewed by the end-user. Therefore, Thumbgrabber requires a means to render images and Webpages as typically viewed by end-users, independent of technology used by the Website (e.g., scripting, CSS, frames, etc.). To facilitate this, Thumbgrabber makes use of Microsoft Internet Explorer® (MSIE) to open and render Web resources. MSIE was chosen both in recognition of its large market share, and because of the ease of programming MSIE components. Thumbgrabber is written in Visual Basic to facilitate interaction with the MSIE programming interface. MSIE also automatically handles details such as HTTP redirects, secure Websites, the caching of images, and event notification (e.g., page load complete). MSIE also exposes the HTML Document Object Model (DOM) facilitating application development. The programmatic accessibility of MSIE's Web cache is especially useful. The MSIE WebBrowser control  allows a complete, scriptable version of the MSIE Web browser to be embedded within a Visual Basic application.
4.2 The "Capture largest image only" method
When this method is specified, Thumbgrabber creates a thumbnail from the largest image of a Webpage that is at least as large as the "Maximum Thumbnail Size" parameter in pixels. Images considered include HTML <img> and <input type='image'> elements, but do not include background images specified using the background attribute. They also do not include images that might be embedded within the page via Java applets or other plug-ins. The HTML DOM conveniently provides height and width properties for these elements that can be used to determine their size. The area of the images (height times width) is used to determine the largest image. If the Webpage is composed of multiple frames, each frame will be recursively examined to find the largest image across all frames. If the Webpage contains no images, or if its images are all too small, Thumbgrabber generates a warning to that effect in the log file, and it then moves on to the next metadata record. The src, alt, and title attribute values are extracted for the largest image using the HTML DOM.
The Web cache maintained by Internet Explorer is utilized to capture images; the location on local disk to which MSIE has saved a cached copy of the largest image is determined via exposed WebBrowser control methods and properties.
4.3 The "Capture HTML Webpage only" method
The "Web Page Size" fields define the size of the Internet Explorer Web browser window that is used to create thumbshots. Typically, a Webpage is larger than can be displayed in a single Web browser window. However, Thumbgrabber is constrained to only capture the portion of the Webpage that can be displayed in a window of the size specified. This size may not be larger than the visible screen being used by the executing program. Using low-level Windows API functions, the region of the screen containing the window is copied to an in-memory raster image that is then saved to disk as a BMP file (before then being converted to PNG for long-tem storage).
4.4 Capture Issues
If there are no HTTP errors, such as 404 Not Found, a thumbnail or thumbshot is captured and saved to disk. A complication arises if a URL is redirected by the destination Web server, typically for an image, so a special routine tests the source URL by issuing an HTTP HEAD request and checking for the actual URL after any redirects have occurred. This actual URL is what is used by the program to create the thumbnail.
4.5 PNG image creation
To create thumbnails, Thumbgrabber relies on the GNU NetPBM  graphic conversion and manipulation toolkit. This toolkit consists of a collection of command line utilities for converting various image formats to and from a common portable format called PNM. The toolkit includes numerous utilities for manipulating PNM image files, such as scaling, rotating, cropping, etc. These various command line utilities are called by the Visual Basic program as needed.
In the case of image capture, the Internet MIME Type of the largest image needs to be determined, such as JPEG, GIF or PNG. This is accomplished by querying the HTML DOM mimeType property or by guessing the type of image based on the file extension. Based on the MIME Type of the image one of several functions is called to convert the image taken from the local cache into a PNG thumbnail. For Webpage thumbshots, the format of the original screen capture is always a BMP file. The BMP file is converted to a PNM before being scaled down to thumbnail size, converted to a PNG image, and saved to disk.
All thumb images are saved as PNG (Portable Network Graphics) files . After the PNG thumbnail is created, extended properties are added both as ancillary text strings directly in the PNG file and also as extended NTFS file properties (see Figure 5). These values allow the image to be tied back to the source Webpage or the source OAI record if needed. The extra properties are stored in both places mostly as a convenience.
Figure 5 - Metadata properties embedded in the PNG files
The quality of the thumbnails created has proven comparable to locally generated ones made available by data providers.
5 Thumbnails as a type of digital library metadata
Thumbnails and thumbshots have the potential to improve the "SELECT" functionality of an aggregation; however, there are limitations to the utility of thumbnails and thumbshots in such contexts. Service providers and data providers must work together to mitigate these problems as much as possible.
5.1 Coordination between metadata, thumbnails and resources
In the CIC project, the thumbnails are an external document stored in a distinct location. The information contained in the metadata records, the resources they point to and the thumbnails must be synchronized (see Figure 6), even while they are created and maintained by different actors (data provider and service provider). This raises the question of how best to link metadata, resources, and captured thumbnails/thumbshots, and how best to asynchronously handle updates.
The frequency for updating thumbnails is typically different from the frequency for updating metadata. Even in the case of a full metadata reharvest, the thumbnail capture process might remain incremental (only capturing thumbnails that do not already exist).
Thumbgrabber checks the disk to determine whether an image has already been captured for a given record; if it has, then it is not recaptured. This allows Thumbgrabber to incrementally build up its collection of images. It also facilitates recovery from cases where a particular Webpage might have been temporarily unavailable. An "exclude list" of URLs for each repository can also be created to exclude URLs known to generate irrelevant thumbnails / thumbshots.
5.2 Thumbnails to market resources
The first reaction of CIC collaborators to the possibility of adding thumbnails to the CIC-OAI metadata portal was to wonder how to represent textual resources. Inclusion of thumbnails for some resources but not for others might improve attractiveness of those resources with thumbnails to the disadvantage of resources not having thumbnails. (Similarly, in regard to thumbshots, Web designers have suggested that users viewing a search results listing can perceive sites lacking a preview image as less important than sites with such images .) This suggests a role for metadata as part of the marketing of resources.
5.3 Information accuracy
While thumbnails can give a sense of an image, thumbshots are generally not of sufficient resolution to allow the user to read the words on the Webpage. The primary information that thumbshots carry is the Webpage structure (see for example, Figure 7). The general organization of a Webpage (e.g., the proportion of image vs. text) does convey useful information. Users can to some degree identify "the layout, genre and style of the page" . However, such hints from a thumbshot are filtered by user expectations. Woodruff et al. differentiate the impact of thumbshots according to the type of query. For example, a search for a picture "requires identification of a graphical element" or a homepage "requires genre classification (correct pages somewhat textual, many incorrect pages entirely textual)" .
Research has suggested that adjustments can be made to make the text of a thumbshot partially readable . Towards this end, thumbshot capture for CIC records was tested with a text-enlarging stylesheet. However, to make the text readable, the enlargement is done at the expense of completeness of information captured. Typically, only a part of the screenshot will be captured and the information on what the page as a whole looks like will be lost. Other techniques allow tweaking the image (Webpage caricatures such as proposed by Wynblatt and Benson ), but at the expense of information accuracy. This may conflict with the mission a digital library has to present trustworthy representations of data provider resources.
6 Shared responsibility for creating graphic surrogates
For a metadata aggregation or distributed digital library system, the present article describes tools and a model to automatically capture and create thumbnails.
An unexpected consequence of the thumbnail remote capture was the increased interest of three CIC institutions in generating thumbnails locally, though to date, only the University of Wisconsin-Madison (see Figure 8) and the University of Iowa are providing thumbnail URLs in their OAI metadata (approach 1 described above). To allow contribution of data provider-generated thumbnails by reference, a thumbnail element from the namespace maintained by the National Library of Australia for use in the Picture Australia project  was imported into the CIC project's Qualified DC schema .
Alternatively, data providers can help insure the correct capture of thumbnails and thumbshots by indicating a preferred URL to use for that purpose. An adequate way of conveying this information is uncertain at this time. One possible strategy is to create a "jump-off page" giving an added URL in a LINK element embedded in the HTML HEAD node (i.e., as a link to alternate manifestation of the resource). This approach was suggested by the ePrints technical report .
Figure 8 - Examples of thumbnails provided by the content provider and thumbnails generated by the Thumbgrabber
Ultimately, implementation of thumbnails and thumbshots in the context of metadata aggregation should be a collaborative process between data providers and service providers. Data providers need to be cognizant of the usefulness of thumbnails and thumbshots in the context of metadata aggregation. They may either create or facilitate the creation of graphic surrogates for their resources. Service providers need to guarantee an adequate level of completeness of metadata and a maximal level of the "attractiveness" of records presented in a cross-repository list of distributed resources. The creation, update and maintenance of thumbnails and thumbshots for digital library resources represents a useful service to data providers, exactly the sort of value-added effort service providers should be encouraged to undertake. The academic digital library community should use graphic information to make scientific and cultural resources more visible on the Web.
The level of effort that content providers are willing to go to in order to facilitate inclusion of their resources in larger aggregations varies considerably (for a variety of reasons). To enhance services built on top of metadata aggregations, OAI service providers can take advantage of information extracted directly from the resources described by the metadata they harvest. As outlined above, service providers can use the URLs found in harvested metadata records to generate thumbnail and thumbshot images for display in search result listings. Especially in the cultural heritage sector, this can be seen as service not only to the end-user, but also to content providers. The use of thumbnails and thumbshots is an example of one of the ways in which digital library services are evolving to take greater advantage of graphical presentation of resources. Additional community agreements on the labeling and dissemination of different views of resources held in content provider repositories would facilitate processing and encourage wider use of thumbnails, thumbshots, and similar forms of extended metadata.
The work described in this article was supported by a grant from the Committee of Institutional Cooperation's Center for Library Initiatives. We acknowledge the libraries of the following participating CIC member institutions for providing thumbnails or allowing the University of Illinois at Urbana-Champaign to remotely capture them: University of Chicago, University of Illinois at Chicago, University of Illinois at Urbana-Champaign, Indiana University, University of Michigan, Michigan State University and the University of Wisconsin-Madison.
This article also benefited from the authors' discussions with Peter Gorman from the University of Wisconsin-Madison regarding strategies for adding thumbnails to the CIC metadata portal.
We would also like to acknowledge the participation of Jian Bai, Princeton University Library, for his assistance with the initial study of CIC partners' Websites and the beta testing of the Thumbgrabber application.
 Functional Requirements for Bibliographic Records: final report / recommended by the IFLA Study Group on the Functional Requirements for Bibliographic Records; International Federation of Library Associations and Institutions, IFLA Universal Bibliographic Control and International MARC Programme. Frankfurt Am Main: IFLA UBCIM, 1997. Available at <http://www.ifla.org/VII/s13/frbr/frbr.htm>.
 Sarah L. Shreeves, C.M. Kirkham (2004). Experiences of educators using a portal of aggregated metadata. Journal of Digital Information, 5(3). Article No. 290, 2004-09-09. Available at <http://jodi.ecs.soton.ac.uk/Articles/v05/i03/Shreeves/>.
 Michelle Chang, John J. Leggett, Richard Furuta, Andruid Kerne, J. Patrick Williams, Samuel A. Burns, Randolph G. Bias, Interacting with collections: Collection understanding. In Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 2004. Available at <http://doi.ieeecomputersociety.org/10.1109/JCDL.2004.1336144>.
 Michelle Dalmau, Charles W. Cushman Photograph Collection: Report on the Group and Individual Walkthrough, 2003. Available at <http://www.letrs.indiana.edu/~mdalmau/cushman/prototype/designDocs/cushWalkFinalReport.pdf>.
 Personal communications with David Dawson, Museums, Libraries and Archives Council, UK, and Bill Landis from the California Digital Library. Expressions of interest in thumbnails and thumbshots has also surfaced within the Digital Library Federation - National Science Digital Libraries expert group developing Best Practices for OAI and shareable metadata, in part under the auspices of an IMLS-funded DLF project creating a pre-cursor portal prototype for a planned DLF OAI-based metadata aggregation.
 Susan Dziadosz, Raman Chandrasekar, Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results? In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002. Available at <http://doi.acm.org/10.1145/564376.564446>.
 Exalead search engine, <http://www.exalead.com/>.
 "An extension for Firefox which enhances Google (all international flavours, too), MSN Search, Yahoo Search, A9, Answers.com (web results), AllTheWeb, del.icio.us and Simpy.com by adding previews (thumbnails) [...]" <http://bettersearch.g-blog.net/>.
 For more information, see <http://www.thumbshots.org/>.
 Shaun Osborne, Museums and Images JISC-FAIR Cluster Group Images in the Harvesting Model, July 2004. Available at <http://www.fitzmuseum.cam.ac.uk/projects/htf/docs/M&I_IP_Images_jul04.doc>.
 CIC Metadata Portal, <http://cicharvest.grainger.uiuc.edu/>.
 Committee on Institutional Cooperation (CIC), <http://www.cic.uiuc.edu>.
 Herbert van de Sompel, Michael L. Nelson, Carl Lagoze and Simeon Warner Resource Harvesting within the OAI-PMH Framework, in D-Lib Magazine, Dec 2004. Available at <doi:10.1045/december2004-vandesompel>.
 Gregory J. L. Tourte, ePrints UK Technical Documentation, <http://www.rdn.ac.uk/projects/eprints-uk/docs/technical/eprints-tech-report.pdf>.
 Foulonneau, Muriel and Timothy W. Cole (in press). Strategies for reprocessing aggregated metadata. In 9th European Conference on Digital Libraries, ECDL 2005, September 18-23, 2005, Vienna, Austria. (Proceedings Series: Lecture Notes in Computer Science.) Heidelberg: Springer-Verlag. <http://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/1155136 2_26>
 Muriel Foulonneau, Tim W. Cole, Thomas G. Habing, Sarah L. Shreeves, Using collection descriptions to enhance an aggregation of harvested item-level metadata. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital Libraries, 2005. Available at <http://portal.acm.org/citation.cfm?doid=1065385.1065393>.
 Thumbgrabber project on SourceForge, <http://sourceforge.net/project/showfiles.php?group_id=47963&package_id=159364>.
 Netpbm home page, <http://netpbm.sourceforge.net/>.
 PNG (Portable Network Graphics), <http://www.w3.org/Graphics/PNG/>.
 Ed Zivkovic, Web design tutorial, Display Web Page ThumbShots without Hosting Images, 2004, <http://developers.evrsoft.com/article/
 Allison Woodruff, Ruth Rosenholtz, Julie B. Morrison, Andrew Faulring, Peter Pirolli, A comparison of the use of text summaries, plain thumbnails, and enhanced thumbnails for Web search tasks. Journal of the American Society for Information Science and Technology, Volume 53, Issue 2. Pages 172 - 185, 2002.
 Michael Wynblatt, Dan Benson, Web Page Caricatures: Multimedia Summaries for WWW Documents. IEEE International Conference on Multimedia Computing and Systems, 1998. Available at <http://ieeexplore.ieee.org/iel4/5648/15134/00693639.pdf>.
 PictureAustralia picture schema, <http://www.pictureaustralia.org/schemas/pa/picture.xsd>.
 A Qualified DC application XML Schema for CIC OAI Metadata Harvesting Service Project Created 2004-07-14, <http://cicharvest.grainger.uiuc.edu/schemas/QDC/2004/07/14/CICQualifiedDC.xsd>.
Copyright © 2006 Muriel Foulonneau, Thomas G. Habing, and Timothy W. Cole
Top | Contents
D-Lib Magazine Access Terms and Conditions