A Low Cost, Low Memory Footprint, SQL and Servlet-based Solution for Searching Archived Images and Documents in Digital Collections

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
November/December 2009

Volume 15 Number 11/12

ISSN 1082-9873

A Low Cost, Low Memory Footprint, SQL and Servlet-based Solution for Searching Archived Images and Documents in Digital Collections

Cristina Tofan, Web Specialist
Crabbe Library
Eastern Kentucky University
521 Lancaster Ave, Main Library
Richmond KY 40475
<cristina.tofan@eku.edu>

Daniel Tofan, Assistant Professor
Chemistry Department
Eastern Kentucky University
521 Lancaster Ave, Moore 337
Richmond KY 40475
<daniel.tofan@eku.edu>

Introduction

Easy online access to digital documents in special collections is a must for any library. Many of the resources in special collections are unique and irreplaceable. Because of their singular characteristics, their preservation, digitization and availability online are of high priority for the library, and in many cases it is part of the strategic plan of the institution.

Vendor products, as well as open source options, are available: as a commercial example, CONTENTdm® is being used at the University of Southern Mississippi Libraries (Capell and Ginn, 2009), at the University of Washington (Lally, 2007; Bond, 2004), at IUPUI Libraries (Kramer, 2005), and at many other places. On the other hand, Fedora is an open source digital repository software which supports the University of Maryland Libraries Digital Collections (Schreibman, 2008), as well as others. Because of their richness of features, these software options were too complex to implement at the institution featured in this article: dedicated personnel is necessary in both cases, and financial support is required for the purchase and maintenance of any commercial product; neither of these requirements could be met.

In this article, we demonstrate a simple, elegant solution created in-house, with no additional monetary commitment, that meets the needs of the institution. The implementation uses the Java programming language and the Structured Query Language, it is low cost, has a low memory footprint, and is very fast. Although using a programming language (Java, PHP, etc.) in conjunction with a database management system (MySQL, Oracle, etc.) is a common mechanism for making the records stored in a database web-accessible, a literature search¹ in the Library Hi Tech journal and the Library Literature & Information Science database did not render other examples of a Java/MS Access implementation.

The library owns a number of collections of documents that are specific to the institution. These special collections consist of photographs and postcards, historical maps, magazine articles about the school's activities, meeting minutes, etc. Patrons can access these resources by coming into the building and making requests to see the originals, but they need a way to search for the items of interest remotely, to see if the library owns them. In order to provide online access to these documents, the library needs to have two major components in place:

A data repository
A search and display mechanism

The data repository part was addressed several years ago, when the Archives staff started to digitize the materials. Descriptive information was provided for each document, based on its type, and a few databases were the result of this process. Individual databases were created in Microsoft Access for each of the following categories:

Images, film, video and audio tapes, negatives
Maps
Indexed articles and meeting minutes

Once the database design was completed, the problem of offering access to patrons led to an inquiry into what mechanism could be used to search and display the data and metadata. One thing that needs to be mentioned is that, at the time of creating these databases, MS Access was chosen due to its availability and popularity in universities. The metadata was recorded in fields that were the choice of the Archives staff, and not Dublin Core standards. For this reason, when later on the issue of online searchability was raised, there was no easy way to import these databases, including the metadata stored in them, in an open-source digital library software such as Greenstone, which would provide the search and display mechanism. Commercial options, such as CONTENTdm® for images were outside of the budget, which made in-house programming the best available option. The other reason for developing custom software was the fact that, although the collections already contain thousands of records and still continue to grow, the requirements for displaying metadata in online searches are modest.

Solution architecture

We chose to implement a custom solution using powerful but inexpensive programming techniques. MS Access has the advantage of providing an easy to use interface and offering search capabilities through the SQL language. The queries needed to retrieve information from the databases described above are relatively simple. The search engine would be based around these SQL queries executed in Access. The main tool needed was an interface between the user and the database. The interface needed to be simple and the mechanism of retrieving the results had to be fast and transparent to the user. Java Web programming seemed to be the ideal solution.

We designed the search engine around the Java servlet architecture. The Apache Tomcat web server, a free program, was used as the web server engine by installing the SUN Java Web Services version 2.0. Tomcat implements the Java Servlet specification, and creating and deploying servlets for Tomcat requires only minimal effort. Servlets are small Java programs that implement the HTTP protocol and reside on the server. A servlet is loaded in the server's memory and waits for requests. When a user sends a request to the server (through a web form), the server (Tomcat) activates the appropriate servlet (predefined in configuration files) and passes control to the latter. The servlet's job is to analyze the web request, to separate the variables submitted (such as search keywords), to send appropriate queries to the database residing on the same server, to retrieve and process the results, and finally to display them to the user in HTML format. The servlet is truly a one-stop solution for the entire search and display process. The database is only the repository of data, and the web server is just the dispatcher. It is the servlet that does all the work.

Database implementation

The specific implementation depends on the details of the database. The structure of each table is different, containing dedicated fields that reflect the metadata specific to each type of document stored.

Table 1 shows the main database tables used, their generic field structure, and total number of records in existence at the moment of writing this paper.

Table 1: Structure of the database hosting documents in the special collections.

Table	Fields	Number of records
Images	Collection, Image number, Date, Medium, Size, Type, Subject, Caption, Notes, Hyperlink	36450
Film and video	Collection, Event, Description, Date, Format, Shelf location, Number of volumes, Size, Notes	3750
Tapes	Project name, Tape ID, Interviewer, Interviewee, Occupation, Date, Location, Tape format, Tape length, Keywords, Transcript, Owner, Restrictions	3980
Negatives	Subject, Description, Date, Negative number	38200
Maps	Category, Title, Description, Date, Scale, Cartographer, Publication info, Source, Notes, Location	680
Indexed documents	Source, Topic, Subtopic, Page numbers, Date	84750

Servlet implementation

The design of the search engine was implemented using the paradigm "one servlet per database table". This ensures that each type of search has its own dedicated servlet. Each servlet implements the HTTP protocol, which is well supported in Java. Multiple searches are possible at the same time through Java's multithreading features, managed by the servlet engine (Tomcat). An instance of each servlet is created when a search request is submitted through the web interface. The servlet processes the request, returns the results as a dynamic web page, then disposes of itself in order to free memory. The process is very efficient. Each servlet instance exists only as long as necessary to serve the results to the user. The servlet engine handles the creation and destruction of each instance and manages memory allocations. This mechanism is very efficient and runs on a very inexpensive server, with less than 1 GB of memory. The search speed is very fast: it takes a couple of seconds for a 500 result search.

Table 2 shows the structure of the servlet architecture and the size of each servlet in kilobytes.

Table 2: Servlet architecture for the search engine.

Collection	Servlet name	Size (kB)
Images	ImagesDatabaseServlet	8.6
Maps	MapsDatabaseServlet	7.8
Indexed documents	IndexesDatabaseServlet	7.8

It can be seen that the signature of each servlet is very small. The bulk of the servlet is represented by the code that displays search results using the same look and feel as the rest of the library website. The search itself is done with just a few lines of code.

Statistics

Upon the implementation of this solution, it was desired to collect statistics about the usage of this new resource. The images and the maps collections were fitted with two additional tables that recorded information about each search: IP of the computer where the search originated, keywords submitted through the search form, and number of results returned. The same servlets were used to record this information. Every time a valid search was submitted, the information about the search was stored in a separate table. Over the course of one year, the statistics shown in Table 3 were collected.

Table 3: Usage statistics for the image and map collections gathered over a 12 month period.

Collection	Number of user searches	Average number of results returned	Campus searches	Off campus searches
Images	19098	46	3311 (17%)	15787 (83%)
Maps	6675	45	135 (2%)	6540 (98%)

Most of the searches originated from outside campus. In the case of the maps index, over 98% of the searches were conducted from off campus locations. There was an average of 18 map searches per day and 52 image searches per day during the first year. This is indicative of relatively heavy usage for a newly offered search capability. It also shows high interest in the library digital collections, which are not the typical resource being used extensively by students.

Conclusions

We described a very simple, efficient, elegant, and fast solution to implementing searches of electronic documents indexed by a library. The solution we presented is very cost effective. Hardware and software resources needed to keep this search engine running are very low. Free software from SUN Microsystems was used to develop powerful servlets in Java. The only costs associated with this system are the licensing of MS Access (inexpensive through academic licensing), the purchase of hardware (inexpensive due to low memory footprint of servlet solution) and the development of the Java and SQL code. By not going with a commercial program, desired customization of the search engine was possible and significant costs were saved. Online access to digital collections was thus provided at minimal cost.

This implementation is easily adaptable to other systems. The solution we developed supports databases containing hundreds of thousands of records, and even larger databases can be used with faster servers. This particular type of implementation is recommended for low-budget libraries in smaller institutions, which may not have the resources to buy commercial packages or hire programmers. The source code of the Java servlet described in this article can be provided upon request.

References

Bond, Trevor J. "Visual image repositories at the Washington State University Libraries", Library Hi Tech, 2004, 22 (2), 198-208. <doi:10.1108/07378830410543511>.

Capell, Laura, Ginn, Linda "Digital Collections: Design and Practice." Mississippi Libraries, 73.1 (2009): 3-7. <http://www.misslib.org/publications/ml/spr09/Libraries_Spring_09.pdf>.

Kramer, Elsa F. "IUPUI image collection: a usability survey." OCLC Systems & Services, 21.4 (2005): 346-59. <doi:10.1108/10650750510631712>.

Lally, Ann. "University of Washington Libraries Digital Collections." D-Lib Magazine, September/October 2007. <doi:10.1045/september2007-featured.collection>; accessed May 2009.

Schreibman, Susan "University of Maryland Libraries Digital Collections." D-Lib Magazine, May/June 2008. <doi:10.1045/may2008-featured.collection>; accessed November 2009.

Note

1. Keywords searched: "java sql", "java mysql", "java", "sql", "mysql", "php", "digital collections", "special collections", "image collections", "CONTENTdm®".

D-Lib Magazine Access Terms and Conditions

doi:10.1045/november2009-tofan

D-Lib MagazineNovember/December 2009

Volume 15 Number 11/12 ISSN 1082-9873

A Low Cost, Low Memory Footprint, SQL and Servlet-based Solution for Searching Archived Images and Documents in Digital Collections

Introduction

Solution architecture

Database implementation

Servlet implementation

Statistics

Conclusions

References

Note

Copyright © 2009 Cristina Tofan and Daniel Tofan

D-Lib Magazine
November/December 2009

Volume 15 Number 11/12

ISSN 1082-9873