Stories

D-Lib Magazine
June 1998

ISSN 1082-9873

Physical Review Online Archives (PROLA)

An Image Archive for the Journal Physical Review


Timothy Thomas
Computer Research and Applications Group
Los Alamos National Laboratory
Los Alamos, New Mexico
[email protected]

Introduction

Under contract with the American Physical Society (APS)1, the Computer Research and Applications Group at Los Alamos National Laboratory has developed, deployed, and tested an electronic journal archive system called PROLA. It is intended to be an image based, complete, full-service on-line archive of the existing issues of the journal Physical Review [Phys. Rev.] from its inception in 1894 to the present. It is presently being moved from its test platform at Los Alamos to an on-site installation at the APS in Ridge, NY. Currently, the archive contains Phys. Rev. A, B, C, D, E & Letters for the most recent 12 years, 1985 to 1996, consisting of about 130,000 articles. The fundamental goals of PROLA are to provide screen-viewable and printable images of every article, full-text and fielded search capability, good browsing features, direct article retrieval tools, and hyperlinking to all references, errata, and comments. At present, the research focus is on designing a system that will make transparent the transition from a massive, existing paper archive to a modern electronic journal.

Background

Los Alamos supported this project primarily to advance the state of the art in scientific information distribution in physics. PROLA was seen as one leg of a three part system, consisting of an electronic archive, a rapid response pre-print server and authoritative reviewed electronic journal. It was hoped that new technical tools, working in conjunction with each other, would produce a revolution in scientific information usage. It was also hoped that the new capabilities, when fully exploited, would make a highly significant contribution to accelerating the rate of discovery in physics.

From the beginning, the archive was viewed as an image archive. A cooperative arrangement was made with the TORPEDO project at the Naval Research Laboratories to carry out the scanning of the paper journals. There was never any consideration of substituting any format other than an image, since an archive, by definition must contain an accurate version of the originally published document. Functions such as searching, browsing, printing, displaying, and navigating are implemented differently depending on the availability of electronic material, but ultimately the goal of PROLA is for the user to find, display and print a completely authoritative image of any Physical Review article of interest.

Basic Design

As designed at Los Alamos1, PROLA is a page-based system, with each article centered around a central document information (doc-info) page that contains the full bibliographic data. Off the doc-info page hangs all the associated material attached to any particular article, generally the following: (1) the abstract, (2) the page images, (3) the printable versions, (4) the pdf versions, (5) the various ASCII versions, (6) a hyperlinked list of articles that reference the current article, (7) a hyperlinked list of articles referenced by the current article, and (8) links to any relevant comments or errata articles. All pages are constructed on the fly and distributed over the World Wide Web.

To speed this process, each article has a short, separate header file containing the most frequently required information. The data are stored in compressed flat files on a BoxHill RAID disk system. To access an article, users must go to the doc-info page and then decide what further information, if any, is needed. They then hyperlink to the desired information from there. As technology advances, links to different forms of information are added or subtracted as needed. For example, initially we did not offer pdf versions, but links to this format were added to the doc-info page when demand warranted. Similarly, anti-aliased screen viewable images for black and white monitors were deleted when the demand for this version evaporated. Neither of these changes required any redesign of the system. We merely edited the script that creates the doc-info page.

In general, PROLA is based on free UNIX software. One major exception is the search engine, which is an old commercial version of WAIS. Our experience shows that an electronic archive is by no means static. PROLA requires constant modification to keep up with the current high rate of technical change. Free-ware facilitates this process by giving access to source code, and by keeping costs down and options open. In practice, this means that a full-time technical support person must be constantly at work -- a conclusion that we resisted, but have now come to accept as inevitable.

Basic Functions

Finding things on a web site with over a million pages is, of course, the critical functionality that determines the success of an electronic archive. PROLA approaches this problem using four types of actions: (1) Browsing, (2) Searching, (3) Retrieving, and (4) Navigating.

Browsing:

This method is borrowed from the paper system. It is based on the idea of a table of contents (toc). In PROLA, the toc is a set of pages that organizes the articles by the year, issue and page numbers and is the primary browsing tool. The toc pages are created from the header information from each article, and are not copies of the paper toc. This browsing tool is primarily for organizing the material in a manner that is familiar to the user. If a person knows the reference, it is faster to use the retrieve function to get to the correct doc-id page than to browse through the toc. Currently, a more effective method of browsing is done by following the paths created by the "reference to" and "referenced by" lists attached to the doc-id page, or by using the relevance feedback. An additional browser based on the subject code attached to each article is planned, but not yet implemented.

Searching:

Full Boolean text searching, based on the various ASCII versions, (mostly troff, TeX and SGML) is the fundamental method used to search for relevant words in the collection. Fielded searches based on author and/or title and restricted by years or by journal (Phys. Rev. A,B,C,D,E or Letters) are also enabled. The author search is perhaps the most useful, since it allows authors to check to see that all of their articles are correctly listed, and it offers a useful form of browsing for related articles. For our use, we have found that "stop-words" cause more trouble than they are worth. They are particularly nasty for words like "A" and "I", since these are also name initials used to disambiguate authors with the same last name. Stop words do help during relevance feedback, but the inverse weighting by term frequency is, in our opinion, sufficient to generate adequate precision levels without the use of stop-words.

Retrieving:

Retrieving is analogous to going to the stacks to pull the article you want when you know the exact reference. In an electronic archive, the same result can be obtained by searching with the title or author strings, but it is much more efficient to simply enter the reference in a form and link directly to the correct doc-info page. By using a form for this purpose, we can also check for the accuracy of the reference and permute the reference to determine if predictable errors in page number, volume number, or journal have been made. We can then suggest alternatives, rather than simply reporting that the reference is in error. Hopefully, this functionality will allow authors to check their reference lists more carefully for errors.

Navigating:

PROLA follows good Web practice by placing navigation bars at the top and bottom of every page. There are a set of links that always appear on every page, such as home, help, search, retrieve and browse, while some pages have additional links, such as next, previous, thumbnails, and print. The addition of mail-to links at many places lets the users easily report any problems. This is an invaluable aid in maintaining and improving the system. A full set of usage statistics also helps to decide how well the system is functioning, and lets us objectively measure actual usage.

Images:

To cover the entire collection from 1894 to the present will require more than 1,600,000 images. This number will continue to grow until the APS is ready to define the authoritative archival version of newly published articles to be the electronic version, rather than the page image, as is current practice. When we began this project, 1,600,000 plus seemed like a huge number of images to imagine distributing over the Internet. Now, with the tremendous reduction in storage costs, and the rapid growth of the Internet, that seems easily manageable. With our recent migration to the BoxHill RAID technology (from an earlier tape/robotics system), the speed of image delivery is no longer a problem from our end. As the Web evolves, more and more users will have sufficient bandwidth to make PROLA an instantaneous presence on their desktop. At that point, (assuming we can solve the financial problems discussed below) we expect that PROLA will be the preferred method of accessing the archives of the journal Physical Review and the usage of print-based library collections should decline.

Availability

PROLA has been available for about one year at Los Alamos National Laboratory. During that time, its usefulness and acceptance has been validated. The mechanism for making the system available to the intended larger global audience has not been easy to develop. There is no technical difficulty with global distribution -- it is much more difficult to restrict than to allow access on the Web. Rather, the problem is how to implement a scheme that will generate sufficient financial support to pay for the system, while not, at the same time, disrupting the existing financing of Physical Review.

The full legal right and responsibility for solving these problems resides with the owner of the material, the publisher, the American Physical Society. The process of moving PROLA from Los Alamos to the publisher's home office should be completed sometime in the summer of 1998. I believe that APS will then make PROLA available on a trial basis as a free add-on to existing subscriptions. A stripped down version may initially be offered with the more advanced features added as demand and experience dictates. Building the prototype was challenging and fun, but the real interest is in seeing how this system will evolve and what effect it will have on the flow of information that makes up the science of physics.


1The author is the Program Manager at Los Alamos, and can in no way speak for the American Physical Society. APS is currently in the process of a major redesign of PROLA to make it more compatible with their existing systems. Los Alamos's responsibility was to design and implement a prototype system. The system that will eventually be deployed, and how it will be financially supported, is entirely the responsibility of the American Physical Society.

The views and opinions expressed herein are those of the Author and do not necessarily reflect those of Los Alamos National Laboratory (LANL) or the Government.

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

hdl:cnri.dlib/june98-thomas