Edward A. Fox, John L. Eaton, Gail McMillan
Neill A. Kipp, Laura Weiss, Emilio Arce, and Scott Guyer
Project Director: Edward A. Fox
D-Lib Magazine, September 1996
As of September 1, 1996, the U.S. Department of Education provided grant support for a three-year, Virginia Tech-led project to Improve Graduate Education with a National Digital Library of Theses and Dissertations (NDLTD), adding to 1996 funding from the Southeastern Universities Research Association (SURA) for Development and Beta Testing of the Monticello Electronic Library Thesis and Dissertation Program. True success in these projects will potentially mean a permanent change in graduate education and scholarly publishing, with digital libraries playing a more dominant role in supporting and disseminating research.
This article serves as an overview of the project, indicating what benefits are likely, what roles various partners (including, we hope, you, the reader) may play, and what related work has occurred in the past. It is also an invitation to universities to unlock their resources in connection with this collaborative project.
If many in the international community join in, the project could lead to a multilingual corpus of vast proportion and significance. Our collection focus is on doctoral dissertations and masters theses, so we will repeatedly refer to TDs (theses and dissertations) or ETDs (electronic theses and dissertations). However, we also welcome special reports (especially those prepared by graduate students) and bachelor theses. Since there are over 40,000 doctoral and 360,000 masters degrees awarded in the U.S. alone each year, and since our aim is for all graduate students to learn how to publish electronically, the annual growth rate of the collection could exceed 100,000 new works per year by the turn of the century. If there is a fair amount of multimedia content included, as we expect will be the case, the collection might increase in size at the rate of about a terabyte each year.
The NDLTD should help almost everyone, and so, through broad cooperation and participation, should be a sustainable effort. Let's take a moment to consider its likely effects on the key parties involved: students, universities, the research community, and the publishing world.
Our project is primarily an effort to improve graduate education. We will work so that graduate students become information literate, learning how to become electronic publishers and knowing how to use digital libraries in their research. The overall process is shown in our diagram of the Life Cycle of an ETD. Toward this end we continue to develop written documents, extensive WWW materials, and a distributed education and evaluation program in which universities accept responsibility for local support.
With access to the NDLTD, graduate students will be able to find the full texts of related works easily, to read literature reviews prepared by their peers, and to follow hypertext links to relevant data and findings. Their professors will be able to point to the best examples of research in their area, even to the level of an interesting table, an illustrative figure, or an enlightening visualization. Also, students can benefit by learning how to become electronic publishers, preparing them for their future work. Since this educational initiative targets all graduate students, it is unique in its potential to train future generations of scholars, researchers, and professors. If they can publish electronically and add to digital libraries, future works they write will not have to be scanned or re-keyed.
Graduate students also may be empowered to be more expressive as they prepare their submissions for the NDLTD, if such is allowed by their committee, department, and university. Some students have already prepared hypertexts as literature, included color images or graphics, illustrated concepts with animations, explained processes with video, or used audio when dealing with musical studies. One masters project about training students to use Authorware included an Authorware tutorial in the appendix. This has already helped people in South America learn more about multimedia technology.
Access begets access, so having more graduate works in the NDLTD is likely to simulate greater interest in theses and dissertations (TDs). Studies at Virginia Tech of the average number of times a paper TD circulates per year indicate a steady growth from 0.55 to 0.85 circulations between years 2 and 4 for dissertations, with a roughly parallel line for theses reflecting an increase from 0.4 to 0.68 circulations over the same period. Based on the increases we have seen in numbers of accesses to electronic journals as they became available on WWW, we expect that there will be a dramatic increase in the average number of accesses to TDs when they shift from paper to NDLTD availability. Most students are eager for their works to be cited, and we plan for our monitoring and evaluation system to record such accesses. With bibliographies on-line too, a citation index among NDLTD entries will be possible as well, helping students keep track of new studies related to their investigations.
Finally, students are likely to benefit financially from the NDLTD. Publishing electronically should save them the costs of preparing at least some of the paper copies now required. There also may be lower fees from their university and other parties for filing their final copy.
Few universities have a university press, and many of those are not profitable. Yet, through the NDLTD, every university can publish the results of its graduate students with a minimal investment. This should increase university prestige, and interest outsiders in the research work undertaken.
University libraries can save shelf space that would otherwise be taken up by TDs, and the costly handling of paper TDs by personnel in the graduate school and library can be reduced or eliminated. At Virginia Tech, for example, the catalogers decided to have students assist with cataloging by adding keywords to the cover page, thus reducing processing in the library. It seems likely that at least one person in each large university can be freed to work on other tasks if proper automation takes place, resulting in simplification of the work flow related to TDs. In addition, library on-line catalogs can provide fuller information by including the abstract from the electronic text.
Student research should be aided by the NDLTD since graduate students will have a single repository for the work of their peers, supported by full-text search. Other researchers, including people in companies interested in opportunities for technology transfer, can look to the NDLTD as a way to quickly learn of new findings.
Suppliers of electronic publishing software already have found it valuable to participate in the NDLTD. Adobe is making generous donations of their Acrobat software, in part because they realize that having all graduate students exposed to the Acrobat line of tools will ensure a large base of future users. Associations like ACM (the First Society in Computing) are supportive, in part because having their members learn to publish electronically in graduate school can help reduce the anticipated cost of shifting to electronic publishing, when authors will be expected to submit final copies of acceptable articles in proper forms (e.g., using SGML) for publication and electronic archiving.
Indeed, it is likely that future shifts in publication practices will be facilitated by the effort to build the NDLTD. This is of particular interest to universities, which now cannot control what happens to the research publications they support, and later spend large sums to buy back research publications from commercial publishers. Through the NDLTD, universities can control one important class of the intellectual property they produce, and can share it freely with other universities to reduce overall costs.
Since almost everyone stands to benefit from the development of the NDLTD, we encourage you to help in this process in a way that fits the mission of your institution. For example, if you are engaged in the development of software or systems for digital libraries, or helping with standards efforts, you can help directly with building the NDLTD. If you are at a university, you can help build local consensus and devise a supporting infrastructure so the NDLTD is a key part of graduate education.
The NDLTD presents unique challenges on many fronts, and help is needed in various technical areas. On the one hand, it is desirable for graduate students to be expressive, using multimedia representations, but this can lead to very large works, even when compression is required. While we observe many ETDs only require on the order of a megabyte, we expect that with images and other media forms, the average size will approach 5-10 megabytes. A single video file can consume one or two orders of magnitude more space; it is fortunate that a computer system with two terabytes of hierarchical storage is available at Virginia Tech to support this project! But even more important will be good software to undertake content analysis of multimedia information. Other software is needed to help with electronic publishing, and other aspects of digital library operations.
Standards also are essential for the success of the NDLTD. If the archive will last for decades, hopefully centuries, its content must be usable many years after publication. If authors work with standard representations, those are more likely to be understood than are representations that are unpublished and proprietary in nature. If the number of standards supported is kept to a minimum, there will be less work in refreshing the archive as technologies and standards change, calling for conversion to more modern storage and representation schemes.
The NDLTD must operate as a production service if it is to replace current library approaches to handling TDs. Thus, reliable, commercially supported digital library systems are needed for long term success. Companies like University Microfilms International (UMI), IBM, and Online Computer Library Center (OCLC) are participating in the unfolding of the NDLTD. Thus, IBM donated a large SMP computer that will serve as the central host for this effort, and which can run IBM digital library software. Various IBM products for handling databases, image collections, searching on image content, and rights management have great potential for helping with the NDLTD.
At universities, while moving toward the NDLTD is clearly advantageous, such a shift requires many changes in policies and practices. The best way to accomplish this seems to be to develop a local plan, with guidance from key staff in the graduate school(s), the library, and the computing or information technology operations, as is illustrated in our diagram of ETD Site Implementation. Then, workflow changes and automation opportunities are likely to be grasped and become practice. With leadership from these three groups, students and faculty can be consulted and involved in detailed planning. It appears likely that a transition period of about a year is required to effect the change from introduction of concept to widespread acceptance and participation in the NDLTD. Note that real benefits of workflow improvement and universal access to online graduate research require a nearly complete shift to electronic submissions of all TDs.
If every graduate student is to submit an ETD, enhancement to the campus infrastructure is required in most institutions. Usually, this is more a matter of will and coordination than large expense, and most would agree that the end result is highly desirable. Let us consider several possible scenarios.
First, if Adobe's Portable Document Format (PDF) is the target representation, most PC, Mac and Unix systems can run the software required to prepare PDF files. Though there are minor complexities related to fonts and special formatters like LaTeX, these can be worked out, and the investment by students or labs in Adobe software is not high (e.g., about $40 per copy of Exchange).
Second, if SGML is the target representation, there are various solutions. One is to use a standard editor, inserting tags, much like what is done by many HTML authors. While possible, the number of tags (see our illustration of the Parts of an ETD) needed makes this cumbersome. Thus, it is better to use an SGML editor, but those are expensive. Microsoft is assisting with the investigation of the SGML Author extension to Word as an appropriate tool, which could be made available in small numbers in campus labs. Virginia Tech is working on conversion software and templates to enable students to use preferred environments like LaTeX, and to automatically make a 100% accurate conversion to SGML.
Third, there is the question of images. Since many TDs have some type of artwork, color scanners with high quality capture capability at 600dpi must be available, along with computers, adequate RAM and disk storage, software (e.g., Adobe Photoshop), technical assistance, and network transfer capabilities to move the results to locations students can more easily access.
Finally, there is the higher end requirement of supporting special multimedia forms. Special systems for audio and video capture and compression are required for these media types. Note, however, that if there are no special multimedia laboratories available on campus, students can pay for such services themselves.
Though the NDLTD is new to many readers, work on it actually began late in 1987! A brief history is in order.
Nick Altair, then at UMI, who had recently worked on the Electronic Manuscript Project, convened a meeting in 1987 in Ann Arbor, Michigan. Representatives from University of Michigan, ArborText, SoftQuad, and Virginia Tech participated.
Soon after, Yuri Rubinsky of SoftQuad worked with Virginia Tech to develop the first SGML Document Type Definition (DTD) for TDs. (This was only revised in 1996, in connection with recent efforts supported by SURA - see below.) Virginia Tech continued work in 1989 and 1990, experimenting with conversion of TDs that were obtained on diskette from student volunteers.
In 1992, the Coalition for Networked Information sponsored a project discovery workshop with 11 invited universities, each of which had documented the interest of their graduate school, library and computing/information technology groups. This meeting was planned by representatives of UMI, Council of Graduate Schools, and Virginia Tech. Subsequently, a number of further discussions were held at CNI meetings. In connection with one of these, representatives from UMI and Virginia Tech visited Adobe, to learn about plans for the Adobe Acrobat family of tools.
In 1993, SURA and SOLINET (Southeastern Libraries Network) joined forces to work toward the Monticello Electronic Library. At the first open meeting, Edward Fox of Virginia Tech was invited to give a presentation, re-introduced the idea of the ETDs, and subsequently became co-chair of the working group on theses, dissertations, and technical reports. There was widespread interest in this and subsequent meetings, and University presidents saw the potential benefits as well at various SURA discussions. Consequently, a group of interested universities sent representatives to a workshop at Virginia Tech in August 1994, hoping to develop specific plans for ETDs. One key decision from that meeting was to work toward a dual representation scheme. Thus, two copies of each TD would be archived, one using Adobe PDF and the other using SGML. Virginia Tech and UMI agreed to explore the SGML conversion problem in more detail. Virginia Tech began to convert some of the TDs it received to PDF.
Late in 1995, Virginia Tech prepared a pre-proposal to the U.S. Department of Education regarding a three year effort to build the NDLTD, and also requested that SURA fund initial work on establishing a part of the Monticello Electronic Library for ETDs for the Southeast. The first of these led to funding September 1, 1996 and the latter covered calendar year 1996 pilot efforts in the Southeast. North Carolina State University was the first institution seeking to join the initiative, and initial electronic submissions are expected there in October. The first regional workshop for Southeastern universities was held August 1-2, 1996, hosted by University of North Carolina, Charlotte. Many discussions have been held, and presentations given, in the region, nation, and even internationally. There appears to be interest in such institutions as: Auburn, Clemson, Georgia Tech, Michigan State, Mississippi State, MIT, Oklahoma State, University of Georgia, University of Utah, University of Virginia, and Vanderbilt.
Interest in ETDs has continued and spread since 1987.
While SGML has always seemed the logical choice for an archive of TDs, there have been serious technical and economic problems that have delayed its usage. First, few graduate students had heard about SGML, and it seemed unlikely that we could educate them about it. However, with the growth in interest in HTML, this problem has been largely eliminated. Second, there are few freely available editors for SGML. While this continues to be the case, discussions are underway with a number of vendors to work out economically feasible solutions in the context of the NDLTD. Third, there has been the problem of how to find an acceptable DTD that would be suitable for authors, technically sound, and could be adopted nationwide. We believe we have solved this problem - see the DTD and related documentation at our WWW site (http://etd.vt.edu/etd/). While it may evolve as comments are received, we hope some version of it will be universally adopted so that TDs are tagged to facilitate searching and formatting. In particular, we have developed software to convert from documents prepared according to the DTD to HTML (for WWW delivery - see our illustration of the ETD Hyper-Text Structure) or LaTeX (for rendering to paper or page images).
Finally, there is the outstanding problem of conversion from word processors and formatters to SGML. We are developing a set of LaTeX macros to ensure reliable conversion from LaTeX to SGML. Similar efforts are planned for Microsoft Word, but may be simplified if SGML Author for Word will fit into the plans.
In the last several years, PDF has matured and been more widely supported, with freeware tools like xpdf aiming to round out the ability to read such documents on UNIX systems. Any computer with Adobe Exchange can write to the PDFwriter instead of a laser printer, and create a PDF file. PostScript files can be converted to PDF using Adobe Distiller. Since almost every tool used in document creation can either work with the PDFwriter or yield a PostScript file, electronic publishing to PDF is relatively straightforward and can be taught during a 1-2 hour training session.
One technical problem with PDF that troubled our early efforts has been solved in 1996. That is, there are publicly available outline fonts that allow authors who work with LaTeX to prepare PDF files without including bitmap fonts (which increase file size, make display and reading on screen impossible, and restrict text searching options). We are developing automated services on Sun systems to allow authors to prepare PDF files with the proper outline fonts included.
Automation is the key to increasing efficiency in handling TDs. The Library and Graduate School have completely redone the flow of work at Virginia Tech so as to eliminate steps and carryovers from the world of paper. For example, authors now are encouraged to submit single spaced documents, which are easier to read on the screen than double-spaced documents. Authors assign keywords to their documents, since catalogers have trouble assigning categories to new works like TDs. Authors directly upload their submissions to a library server, where the graduate school can check for proper form; there is no longer a need to deliver to the graduate school and have them move completed works to the library.
Central to our automation is a WWW submission page, which is filled in by the author, and leads to uploading and archiving of the TD. This operation includes students authorizing the university to handle access (non-exclusively) to their works, classifying the work (e.g., thesis or dissertation), and providing email information about them and their chair (so completion of processing can lead to automatic notification).
When SGML submissions are easily accomplished, they will be the basis for a variety of derivatives. One is the HTML version already mentioned. Another is the MARC record needed for cataloging. Third is the entry for Dissertation Abstracts. Once these can be produced, the submission process will be simplified even further.
Since Spring 1996, there have been a variety of workshops to train students regarding electronic publishing (using PDF, tools like Word and Exchange, LaTeX) and the automated submission effort. By holding events every few months, handling email questions, making special visits to interested groups, and providing on-line FAQ files, the needs of graduate students are being addressed.
The Faculty Development Initiative at Virginia Tech involves training the entire faculty over a four year period about electronic publishing, workstations, networked computing, and educational technologies. A regular part of the FDI is for faculty to learn about Adobe Acrobat and the handling of ETDs - thus over 600 have been trained about this initiative. Others have been exposed in College meetings, through newspaper explanations, or through the open workshops.
In Spring 1996, the Commission on Graduate Affairs agreed to require ETDs at the start of 1997. Thus, all students will prepare an electronic submission, and the Library and Graduate school will not accept or receive paper submissions. This is a serious plan! We hope that after months or perhaps a year of working with the NDLTD, other universities will follow this scheme, so that students really will learn how to publish electronically and use digital libraries.
Development of the NDLTD fits in with other digital library and other electronic publishing efforts. Some of the most closely related ones are as follows.
Beginning in 1992, with the Wide Area Technical Report Service (WATERS), Virginia Tech has been involved in digital library efforts related to computer science technical reports. In 1995, the WATERS group joined the CSTR group to form the Networked Computer Science Technical Report Library (NCSTRL). Virginia Tech is a regular member. Fox is a member of the NCSTRL Working Group, and the NCSTRL backup server runs in the Virginia Tech Computing Center.
The software used with NCSTRL is available for use with NDLTD and can support a distributed system including situations in which UMI and Virginia Tech, for example, serve sites that do not wish to maintain their own servers.
NCSTRL is one of the early adopters of the CNRI handle system. Virginia Tech has obtained permission for a top-level naming authority for theses and will run a local handle server for TDs so that persistent names can be guaranteed.
Since 1991, Virginia Tech has worked with ACM and others to develop a prototype digital library for computer science and to apply it to improve related educational efforts. Some of the software developed may be of use for NDLTD. The methods and tools used for monitoring WWW use and analyzing that data will be a part of the evaluation component for NDLTD.
IBM has collaborated with Virginia Tech in several ways regarding digital libraries. The central server for the NDLTD will run IBM digital libraries software. Where possible, local development will be reduced when commercially available solutions apply.
For digital libraries to be successful, they must be sustainable, scalable and usable. With a world-class Center for Human-Computer Interaction at Virginia Tech, and with a Department of Computer Science whose main focus is HCI, working toward a usable system will be an ongoing and central concern for our efforts. Usability labs and research in remote usability evaluation should help our efforts, as will related projects for WWW monitoring and analysis. So, we turn our attention to the other two legs of successful digital libraries, starting with sustainability.
Every university with a graduate program is obliged to deal with TDs and to ensure that graduate students are properly educated. As argued above, the NDLTD is in the best interest of students and universities. Thus, to carry out the mission of educating graduate students and handling their TDs, universities should ensure that they know how to publish electronically and how to use digital libraries, which can be accomplished most efficiently by joining the NDLTD effort.
Similarly, many university libraries and/or archives have assumed the responsibility of having copies of works written by local faculty, staff and students. This has been a particularly strong tradition in the arena of theses and dissertations. On many campuses the library is committed to maintaining such works indefinitely, which fits into the long term goals of the NDLTD.
Universities support students in their roles of publishers and researchers. Having the right infrastructure to support local involvement in the NDLTD fits in with the general type of support that universities need to provide.
Because of saving the costs of copying and submitting paper versions of their TD, we believe students have an economic incentive to participate in the NDLTD. Similar savings are expected for universities, in particular their graduate school and library. Since students still will provide some payment to their university when submitting the TD, there is an economic foundation for continuing the effort as a self-sustaining enterprise.
The NDLTD effort is scalable by its very nature. First, it builds upon a system of higher education (in the United States) that has demonstrated its ability to scale to meet needs throughout the twentieth century. Second, it makes use of technology that is modular and distributed, and which is addressing needs of a growing number of computer science departments. Further, this effort piggybacks upon other normal activities of universities, relating to education, scholarly communication, and libraries - each of which demonstrates a fair degree of scalability.
Each university has responsibility for its own TD collection, but can handle that locally or assign it to others. At the level of a university the problems are not terribly large - even if a thousand ETDs are submitted in a year, the disk space required to store them probably would cost less than $3,000.
In some cases, there are statewide consortia for library information sharing. Thus, the VIrtual library of VirginiA (VIVA) initiative could allow for a statewide coordination of part of the NDLTD, supporting the needs of small universities where running suitable servers is not warranted.
As in the case of the Monticello Electronic Library, having a regional consortium for NDLTD is quite sensible and feasible. There are parallel groups to SURA, SOLINET, and the Conference of Southern Graduate Schools in other regions of the U.S.
In the U.S., the NDLTD represents the national effort. Researchers in other countries like Korea are looking into a similar effort in their country that would connect with NDLTD.
For NDLTD to be successful, there must be long term support. UMI already has an archive of 1.3 million TDs, in microform, and is willing and able to provide long term electronic archive services. Other groups also are interested in this opportunity. Negotiations between universities and UMI are needed to work out the proper arrangement for all parties, in the context of the growth of NDLTD.
Future work on the NDLTD is laid out in the proposal to the U.S. Department of Education, which is included in PDF form on the WWW pages for the project. Collaboration with UMI is expected on all fronts. Some of the other highlights are as follows.
The NDLTD effort will involve collaboration with the Cornell Digital Library Research Group, which has developed the software used with NCSTRL, and with CNRI, which has developed the handle system and other digital library services. There also is collaboration with IBM regarding their digital library systems and software. OCLC has promised support from its Office of Research, especially regarding useful tools. Other collaboration will take place in the context of electronic publishing work, such as with Adobe.
The NDLTD has support from many groups interested in universities, graduate education, libraries and networked information. There will be close coordination with the national and regional graduate school groups, presentations supported by CNI, and of course ongoing work with SURA and SOLINET, as well as similar associations in other regions.
Since we aim to improve graduate education, we must afford equal access and undertake a careful evaluation. A detailed evaluation plan is given in the proposal, to include surveys, logging, focus groups, and other efforts. Usability studies will help with detailed analysis and improvements of interfaces. It is expected that about one-third of the project will relate to evaluation issues, both formative and summative.
Thus, we hope to not only develop a large and valuable digital library to support graduate education and research, but also to show that it has proved to be of benefit, and that graduate students indeed know how to publish electronically and how to use digital libraries.
The U.S. Department of Education's Fund for the Improvement of Post Secondary Education supports NDLTD. Authorized funding for the first year is in the amount $69,762. Anticipated future funding for years 2 and 3 are: $69,337 and $68,941. If all federal funding is received as planned, the total would be $208,040. Virginia Tech will provide institutional support which gives federal/nongovernmental percentages 53.3/46.7. Additional in-kind support for the FIPSE proposal has been promised by: ACM, Adobe, Council of Graduate Schools, Coalition for Networked Information, Cornell Digital Library Research Group, Council of Southern Graduate Schools, IBM, OCLC, State Council of Higher Education for Virginia, SOLINET, SURA, UMI, and University of Utah Graduate School.