Jonathan T. Hujsak
Digital Library Systems Group
D-Lib Magazine, January 1996
Digital libraries represent a powerful new set of tools for preserving, disseminating, and reusing technology on a large scale within distributed corporate organizations. Drawing from disciplines including information retrieval, the Web and library science, digital libraries provide a means for capturing and preserving design, manufacturing and quality information for complex products and systems during their lifecycles.
Our research effort is focused on the utilization of digital libraries for the storage and retrieval of reusable product design data within closed, corporate environments. By fostering technology reuse, digital libraries produce significant returns in the form of cost savings, productivity gains, and quality improvements that are achievable on a large scale in multidivision, multinational corporations. Digital library technologies are directly applicable to many different manufacturing industries including aerospace, defense, biotech, chemical, automotive, consumer electronics, and healthcare.
Corporate digital libraries differ from public digital libraries in the nature of their collections. Corporations today maintain many of their key technology assets in electronic form during the lifecycle of a given product or program. These assets consist of highly structured collections of memos, reports, drawings, manuals, specifications, etc., stored in formats such as page description languages, markup languages, word processor files, spreadsheets and CAD/CAM files. Corporate collections also include highly unusual objects such as telemetry recordings, 3D models, simulation results, x-ray inspection images, etc., generally not found in the public library environment. These electronic forms of design data are readily imported, translated and stored in digital libraries using commercially available data conversion utilities. Corporate digital libraries must also reference a wide variety of on-line services and tools available from internal support groups such as thermal analysis, simulation and modeling, structural analysis, vibration analysis, and a host of others that support the different phases of the design process. They must be able to interface with on-line databases of critical information such as military, environmental, and safety design specifications. Public research library holdings, however, consist largely of hardcopy books and research journals that must be captured, converted and processed into computer readable form, an expensive and time consuming process.
Corporate digital libraries also differ considerably from public digital libraries in how they handle intellectual property. Corporations generally work with their own intellectual property and are free to distribute it internally without legal constraints. Public digital libraries must provide a potentially complicated legal, accounting, and revenue collection infrastructure for compensating intellectual property rights owners and/or holders when their information is made available on-line through licensing arrangements or other fee structures. The compensation itself may vary with the type of user involved adding still more complexity. Corporations are free of these requirements and need only concern themselves with issues such as the protection of trade secrets and the licensing of technology between internal business units.
Tyrian Corporation was founded in 1987 to develop systems and technologies for preserving and enhancing corporate memory. Its staff includes scientists and engineers drawn from various backgrounds including manufacturing information systems, computer-based training, systems engineering, defense and intelligence-related information systems, and information retrieval research.
In 1993 Tyrian proposed and won a Phase I digital library development contract through National Aeronautics and Space Administration (NASA) Goddard Spaceflight Center in Greenbelt, Maryland (NAS5-38035). This six-month project investigated a system for data mining of software archives across wide area networks and the subsequent cataloging and storage of reusable software components in a digital library. The study produced a concept for a digital library composed of distributed HTTP servers, each backed by a combination of a relational database engine and a vector-space/Boolean information retrieval engine. Authorized users can access the stored resources via conventional Web browsers using dynamically generated HTML pages and forms. A set of data mining tools are used for discovering, classifying, evaluating and cataloging software assets by launching agents that are capable accessing information over the network. The agents descend file system hierarchies searching for information patterns specified by the human cataloger. Objects matching desired patterns are indexed and summarized for later, more detailed, analysis by human software librarians.
The development of the Phase I digital library included the task of dealing with design dependencies between the various components of a product or system to be stored. Software is invariably a component of a much larger product or system, and the reuse of a software module requires the identification of dependencies between the module and other related hardware/software modules. The digital library must be able to capture and preserve these dependencies for reference by future designers. We addressed this task by generalizing the Phase I architecture to capture and preserve the entire interrelated structure of products or systems of which software is a part.
The Phase I project also sought to produce a digital library architecture capable of capturing the empirical side of the design process as well as the analytical. Design changes are sometimes made because they work, not because there is a clean analytical explanation available. For example, the cause of high altitude catastrophic failures of an ICBM in the late 1950's was never revealed by the analysis of downlinked telemetry. The problem was solved empirically at the cost of several vehicles. The reasons behind the design changes were documented only in the notes of the engineers at the time and were not preserved for future design engineers. Years later, design engineers removed one of these early design "fixes" in a modernization effort intended to improve vehicle performance. The result was the loss of a vehicle and satellite worth hundreds of millions of dollars. The Phase I effort developed methods for capturing and representing fragments of design data and will be extended in future releases to allow creation of hypertext annotation layers by successive generations of engineers.
Tyrian is currently working on a 2-year Phase II follow-on effort (NAS5-32821) that will produce a significantly more advanced digital library for NASA project library data. The prototype system will be delivered to NASA in early 1996 and will publish on the NASA intranet (a private, closed version of the Internet) a project library from one of GSFC's current spacecraft efforts.
Our design philosophy is based on embracing three closely related disciplines that all too often are employed independently of one another - information retrieval, library science and hypertext.
The study of information retrieval has produced a number of methods over the last several decades for accessing information stored within digital libraries. Information retrieval systems provide user-centric retrieval methods. The user, through the formulation of complex queries and relevance feedback, imposes his or her own structure on the collection. Information retrieval systems are capable of extracting the underlying concepts representing documents (or document sections) and using them to represent the documents in the search space. They are capable of suggesting to the user alternative terms to search for when the user query terms do not match those used to index a collection. Information retrieval systems can also be used to analyze unfamiliar collections to reveal hidden structure existing within them. All of these capabilities are useful for revealing the relationships between pieces of design information that are tied together by common terminology.
Information retrieval by itself is highly effective if the user has some foreknowledge of the domain of discourse. It is less helpful if the user is unfamiliar with the terminology, multiple domains are represented, or the collection is highly dynamic. Problems have arisen in the design of on-line catalogs which utilize information retrieval methods to match customer requests with large collections of stored product information. The catalog systems fail when customers ask for products using terminology not found in the product descriptions. The solution, especially for highly dynamic domains such as consumer electronics, consists of augmenting information retrieval methods with human-constructed and maintained editorial thesauri. Although commercially available tools do exist for generating computational thesauri, the results in practice have often been found to be too contaminated with non-related terms to be useful. In the future, this situation should improve as new technologies emerge from research efforts in natural language and computational linguistics.
Library science has also contributed significantly to our system architecture by providing structured methods for descriptive cataloging of the multitude of documents associated with highly engineered, complex products and systems. For thousands of years, librarians have addressed the problem of making information retrievable, focusing much of their efforts on the task of cataloging. Cataloging is usually performed by librarians who are generalists as opposed to narrow domain experts. Librarians are thus faced with answering the fundamental question of "what is it?" for each object to be cataloged. To answer this question librarians evolved elaborate sets of rules for classifying and cataloging documents. Some of these rule sets have been standardized such as the Anglo American Cataloging Rules Revision 2 (AACR2). Specialized collections, such as those found in a corporate library, often require the development of a collection-specific set of cataloging rules. Even with sets of rules to guide them, however, the cataloging of new materials (original cataloging) is a slow and expensive process.
There are two ways to offset the high cost of original cataloging. Libraries can purchase catalog records in a standard, machine-readable format or employ automated tools for the extraction of cataloging information from original documents to generate the records. Over the years librarians have formed organizations to standardize the interchange of computer-readable cataloging information so that they could share cataloging results. An example of such a standard is the MARC effort administered by the Library of Congress. The MARC record has been easily adapted to unconventional materials such as computer software, multimedia and data files and is being extended in a multitude of ways to represent resources available from on-line services.
Product design documents, however, are usually generated within a closed corporate environment. Cataloging records for these documents are unlikely to be available from outside cataloging sources. This leaves the alternative of automating the cataloging process, the approach taken by this project. Cataloging rules are being incorporated into automated tools that go beyond basic term extraction or phrase recognition to performing rule-driven reasoning about the structure and content of the document. Documents are automatically compared to identified document clusters in the existing collection to suggest areas where the document should be referenced. Information extracted from the document is used to automatically generate and update hypertext library indexes ordered by subject, author, title, product or project, etc. Some of these techniques can be provided to a librarian for use on site. Others require complex reasoning and analysis and are more appropriately delivered by a service bureau where farms of high performance processors can efficiently process large numbers of documents. Service bureaus spread the high cost of this infrastructure across large numbers of customers.
The hypertext component of our system architecture is drawn directly from the Web. Future releases will include a number of capabilities emerging from the current W3C working groups. In particular, annotation capability will be added as a means of continually enriching the collection. As pieces of technology are reused for different purposes, "annotation sets" (ref. Terry Winograd and the Stanford University Integrated Support Services for Digital Libraries
Digital libraries will probably never be an off-the-shelf shrink-wrapped technology. Digital library providers must be prepared to quickly adapt their product configurations to a wide variety of industry requirements. The structure of a library is highly dependent on the specific type of information it contains and the access methods that are required for retrieving the information. A biotechnology firm seeking to manage large collections of clinical information and research results would structure the information largely to facilitate reporting requirements for the FDA approval process. An auto manufacturer seeking to store product design information, however, would structure its information to facilitate reuse by internal design groups.
Corporations implementing digital libraries are often faced with the initial problem of organizing large existing collections of historical or legacy data. At the end of a product or project lifecycle, materials that are preserved are often left in a disorganized heap. The process of organizing the material often overwhelms the limited resources of corporate libraries and requires the assistance of outside archivists to help define a structure for the information and hence the design for the digital library. Digital library vendors must be prepared to support their customers in organizing existing collections and participate in the iterative process of designing a library structure to hold them.
Once a library structure has been defined for the collection, the materials must be captured and converted. This process bears some semblance to document automation. In digital library applications targeted at capturing product design information, however, document formats have much greater variation in size, format, logical type and content. In addition to standard document page formats, there are many different sizes and configurations of engineering documents such as drawings, strip charts, data plots, color photos, screen dumps, briefing slides, artwork and so on. In many cases the bitonal TIFF/CCITT-G4 file format of document automation is entirely inadequate for preserving the information. Color encoding of captured graphics data may require up to 24-bits per pixel to preserve the original content. The conversion of existing hardcopy documents into full text searchable form requires segmentation documents into text and graphics, OCR/ICR of the text segments, and reintegration of the document into a hybrid, searchable form. Eventually captured and converted materials must be classified and cataloged in accordance with the structure of the library.
The capture, conversion, classification, and cataloging of large existing collections requires a significant investment in infrastructure, personnel, and training. Implementing such an infrastructure for an initial loading phase of a digital library is typically not cost effective for most organizations. Outsourcing these functions to service bureaus that spread the cost of the infrastructure over many different customers is often a more practical approach.
Digital libraries offer potential solutions for managing quality data and reusable product information for emerging standardization efforts such as STEP, CALS and ISO9000.
The STEP (Standard for the Exchange of Product Model Data) effort, backed by the international ISO 10303 standard, seeks initially to establish standard ways of preserving and exchanging product data that will foster intra-organizational processes and inter-organizational relationships. The standard is intended to provide a neutral standard to which design engineering tools can interface. The resulting infrastructure will allow engineers from different disciplines such as mechanical, electrical, software, and manufacturing to interact in the design process. One of the key players in the STEP arena, the National Initiative for Product and Data Exchange, has suggested the possibility of a Product Data Exchange Highway that ties together all aspects of product design, analysis, manufacturing and support processes. Digital libraries could potentially provide information repositories located on such a highway that preserve and maintain current and historical design data.
The DoD CALS (Continuous Acquisition and Lifecycle Support) program represents a comprehensive strategy directed at accelerating the transition from paper-based product development, design, manufacturing, and support to a highly integrated, automated system based on electronic document management. As with STEP, CALS is working towards standardized forms for product data exchange. The standard is actively working towards the goal of automated systems for the management of product lifecycle data. The federal government has strong motivations for pursuing the CALS objectives. The United States Navy is tasked with maintaining over 237 million drawings and over 15 million technical manuals for its various systems. The annual cost of this maintenance is approximately $4 billion. Here, again, is a place for digital libraries in the management of large collections of current and historical design data and the dissemination of the data throughout a large worldwide enterprise.
Compliance with the ISO9000 quality standard is currently a high priority among many different manufacturing enterprises. One of the basic tenets of ISO9000 is the recording and preservation of quality data and reports during the product lifecycle. Quality data is simply another component of the overall lifecycle record of a product or system. Digital libraries potentially can be used not only as recording mechanism but a reporting mechanism for achieving and maintaining ISO9000 certification.
Clearly, digital libraries will play an important part in future standardization efforts through their ability to capture, preserve, and make accessible the complete lifecycle record of products and systems.
In this article, we have attempted to show that digital libraries have significant potential for the capture, storage, and dissemination of reusable product design data. Corporate digital libraries are far easier to implement than their public counterparts and are free from the burden of managing complex intellectual property issues. They produce significant rewards in terms of cost reduction, productivity improvement and quality enhancement. As traditional corporate libraries begin implementing digital library systems, the librarians that operate them will find themselves with an important new role as guardians of corporate memory within the distributed corporate environment.