An Integrated System for Distributed Information Services

George H. Brett II
Internet Consultant, Boulder Public Library
Instructor, University of Colorado at Boulder
[email protected]

D-Lib Magazine, December 1996

ISSN 1082-9873

Abstract
1.0 Introduction
2.0 The Distributed Information Services Model
3.0 Process Elements and Interactions
4.0 DIPP Taskforce Work Plan
5.0 Summary

Abstract: [top]

There are many ways to publish information on the Internet: through the WorldWide Web, ftp archives, or on-line databases. However, the process of converting or transferring content from the point of origin (e.g., desktop computer word processing file) to the point of dissemination (e.g., a popular World Wide Web page) so far has not been easy to do, much less simple to replicate. The University of Colorado at Boulder in conjunction with Enfo (a telecommunications company) has begun work on a new digital publishing system known as the Integrated System for Distributed Information Services (ISDIS) which implements the proposed Distributed Information Processing Protocol (DIPP) in such a way as to make publishing digital information in distributed environments as simple as sending e-mail to a colleague or saving a file to a local directory.

DIPP has been used for the past year as the core technology for publishing and maintaining web content at public information sites including the Boulder Community Network, some of the State of Colorado World Wide Web pages, as well as some other sites in Colorado. Up to this point, each application of DIPP had to be "hand wired" or customized for the particular input and output. Now, with funding from the National Science Foundation (NSF), the DIPP project team will be working to develop a public standard that uses a proscribed set of API's and "agents". This "work in progress" can be found at the following web address: http://www.colorado.edu/DIPP/. (NB: The site is new and content is being added as available. Currently there are links to recent papers and presentations as well as links to other sites using early versions of the "hard wired" DIPP).

1.0 Introduction [top]

One of the more evident features of the emerging public on-line information space is the importance of aggregator and integrator services that collect and combine information. Examples of these services include: academic campus-wide information servers (CWIS), scholarly and government on-line information sources, community networks, and other such on-line service providers.

Acting as digital editors, these services collect and organize information from disparate sources to create content that is well organized to ensure overall ease of use. By managing and structuring information, such services can serve as beacons in the multi-dimensional Internet ocean, that serve communities of end-users, helping them to find information in the disparate streams of information that characterize the distributed environment.

Essential to the continued growth of public information services is development of effective means to manage information whose ownership extends across a broad range of autonomous units. Currently, the operation of such services is a highly manual and fragile task, requiring extensive hands-on labor at the server side as well as considerable technical expertise from information providers. These requirements are often in conflict with the limited resources of these organizations. The inability to maintain existing services as the information volume increases exponentially is perhaps the single greatest weakness in the effective management of networked information. There is a strong need to automate these information management processes.

It should be noted that public sector environments have issues in integrating and presenting information that vary from those of corporate or commercial environments. These include:

a distributed authority where multiple persons must be able to interact with or manage information,
a need for dynamic chronological, topical, and geographic aggregation of information on a central server,
a reliance on open and public-domain solutions that more often as not are based on financial advantage,
the accommodation of a wide variety of file or data formats from originating sources, and
a variety of information transmission mechanisms that go beyond electronic mail (e-mail).

Despite these differences network tools developed for the public sector environment have immediate application in the corporate environment as well as scholarly (research and education) situations. That is to say, these tools are broadly applicable.

Working within a large and complex network of information servers, a task force centered around the University of Colorado at Boulder has developed a fundamental understanding of the requirements and constraints that shape the distributed information publishing environment. Based on experience and extensive discussion with research and education organizations, the task force project has developed a Distributed Information Services model which includes prototypes of a Distributed Information Processing Protocol (DIPP) and associated application programmer interfaces (API's).

The initial goal of this project was to automate the acquisition, conversion, and mounting of information from a variety of autonomous sources onto a common Web platform. Ultimately, the Integrated System for Distributed Information Services will provide a basis to implement additional information services across any distributed environment.

Specifically, the project seeks to:

create a reference model for a distributed information services environment;
develop an open protocol, DIPP, along with open application programmer interfaces (API's) between functional elements of the model;
produce a freely-available software suite, DIPPswitch 1.0, for Unix and Windows-NT platforms, along with a collection of "agents"for common information processing tasks, that together provide an immediate, effective, and automated distributed information system; and
promote these standards widely in order to establish a competitive marketplace for future developments in the area

Building on a National Technology Initiatives Agency (NTIA) supported environment with direct support from the National Science Foundation (NSF), the project has made steady progress over the last year. DIPP has demonstrated to be a value asset for those persons and agencies responsible to providing timely content for public information sites.

This paper describes a process model and identifies roles of the Distributed Information Processing Protocol (DIPP) and associated Application Program Interface modules. Future papers and technical reports (Requests for Comments [RFC]) will advance the technical specifications and bindings.

It is anticipated that DIPPswitch 1.0 will be freely available to non-profit agencies by the end of the first quarter of 1997.

2.0 The Distributed Information Services Model [top]

2.1 Motivation

As indicated above, public sector information providers need simple tools that provide flexibility in submitting material for posting to online services. The tools that they use should invoke standard network utilities and require minimal additional operations. In addition, there is a need to accommodate a variety of network applications for the transmission of information such as e-mail, file transfer protocol (ftp), and World Wide Web. These software tools are on the client side of the client/server metaphor that is often used to describe networking and distributed information services.

On the server side, an automated environment is needed to assist in the operation and the management of information services. Such an environment includes applications to convert information, processes to index and archive content, and methods to audit system activities.

2.2 Context

As we view the distributed information services model, it is useful to reflect upon roles of participants within the traditional print environment. Typically, there are authors, editors, and publishers. In keeping with this hierarchy, we think of these roles in the following way in our model: Authors create information. Editors collect information and organize it into coherent blocks. Publishers choose the distribution form, market, and audience. In information space, these relationships are more often a heterarchy which is defined as "a form of organization resembling a network or fishnet. Authority is determined by knowledge and function." [ref. http://pespmc1.vub.ac.be/ASC/HETERARCHY.ht ml].

In the DIPP model, the role of the author is assigned to those persons who create content that is to be published. For example an administrative staff person produces a weekly agenda that traditionally had been printed and distributed. In the DIPP environment, this individual can e-mail the final document to the DIPP server. The outcome will be an on-line document or multiple documents ready to be used on-line.

The DIPP editor works with the authors to make sure that the DIPP server functions smoothly. Editors have technical authority to adjust certain parameters of the control database sets in order to meet changing authors or create new branches of information.

Finally the DIPP publisher manages editors and authors as well as the DIPP server. The publisher is the final authority for technical decisions and changes. In effect, this person operates as the system administrator for the host computer and network as well as for the DIPP system itself.

In considering an automated system that dynamically integrates and presents information from various content providers, it is useful to distinguish several steps:

(1) Registration -- a component, involving the system operator and an information provider, that establishes the key parameters and processing patterns for documents that will be submitted on a regular basis. In other words: authorization and authentication of submitted materials. Determining what processing methods are required, certification that the submitted materials are complete and authentic.

(2) Processing -- the conversion of submitted documents, operation of automated tools, response to diagnostic and error messages, and interface with central server operating system. In other words: converting submitted materials into edited output contents with repeated checks to ensure quality control.

(3) Maintenance -- modification of registration data, which includes information about those who are approved to submit information as well as other data on the server used to manage the system. In other words: some method of system administration that allows updates and changes to various controlling elements by the appropriate levels of staff members.

2.3 The Logical Model:

The following logical model describes an analysis of the typical activities in operating a distributed information server. These activities include major elements and interactions of the processing systems based on an open protocol standard and an open application program interface process (API).

Figure 1.

2.3.1 Control Block

The control block consists of information data sets that direct the processing pattern which are applied to individual documents. These data sets reside on the server, contain pre-defined parameters and control the processing through application of selected elements including: "agents", parameters provided to those "agents," and special instructions. The control block also contains authorization information and other management information.

2.3.2 Agent Set

Agents are specific program segments that receive input, act on provided data and transform information space. They also help administer the operation of the DIPP engine. Three major categories are:

Initialization Agents - Initialization agents form the primary action pipeline for a DIPP job. The agents read the data and protocol statements from the client. They authenticate and authorize the user who submitted the information. They verify the content of the files as original and unchanged. The initialization agents establish a sequence for the job context based on stored server-side files. This set of agents is invoked on every DIPP job and therefore make up the administrative overhead of DIPP. Following execution of these agents, the data is passed on to the Processing Agents.

Processing Agents - Processing agents regulate the automated processes used to convert documents, integrate information, and to maintain or manipulate information space. In keeping with the publishing metaphor, three classes of processing agents have been established: Scribes, Clerks, and Integrators.
- Scribe agents convert the format and structure of a document into posting formats and files. Already a number of such agents have been developed to convert materials to a predetermined format (e.g., HTML, SGML, GILS), to include associated files (e.g., button bars, images, tables), to archive expired materials, and to send notification to information of work completed to providers.
- Clerk agents mount converted information to predetermined locations and manipulate the general structure of the information space (directories).
- Integrator agents merge information from a variety of providers into a navigable framework which may be chronologically, topically, or geographically organized.

Management Agents - Management agents install and remove agent modules. They update information stored in the control block data set. Management agents also control the server engine. These are specialized processing agents which can manipulate server side data files if invoked by authorized users such as system administrators.

2.3.3 Delivery Clients

Delivery Clients receive the information sent from the user by a variety of methods (e.g., electronic mail, file transfer protocol (ftp), etc., etc.). The delivery clients then deliver both the raw data and information about the data (meta data) to the DIPP server using the DIPP protocol.

2.3.4 DIPP Engine

The DIPP Engine invokes agents, as directed by parameters set by the control block. This series of processes serve to automate all steps of publishing between initial receipt of the original document to the final publishing of the information to a proscribed location.

2.3.5 Management Program

Administrators use the management program to access and edit the various parameters controlling the DIPPswitch program. This management program is the interface for the system administrator, which in turn must be able to communicate with the control block data sets, agent sets, and directly to the DIPPswitch program.

3.0 Process Elements and Interactions [top]

Functional elements of the logical model have a set of interactions among them. For example, submitted material must include authenticating information; DIPP software agents must negotiate parameters with the control data sets; and the management interface has to manipulate content in the control database sets. These interactions are built on two processing standards: DIPP & the DIPP API.

DIPP: The Distributed Information Processing Protocol negotiates data and parameters between originator (delivery client) and central server (DIPP engine).

DIPP API: The DIPP Application Program Interfaces, define parameters passing between the DIPP engine and elements of the administrative pipe. These include the control block database set, the agent pool, server data and access control language.

Figure 2.

3.1 The Distributed Information Processing Protocol

DIPP - the Distributed Information Processing Protocol is comprised of a set of message formats and reserved words which encapsulate meta data passed from the delivery client (originator) to the DIPP engine. The protocol proscribes the initiating data the delivery client sends to the DIPP engine. The initiating information is parsed to identify and authenticate the submitter as well as define the "job" to be done. The DIPP protocol permits author and server to securely pass both documents and meta-information to manage the processing of the data.

3.2 The Distributed Information Processing API

As a self-contained system on the server, the set of cooperating processes in the DIPP system need to exchange information through an established set of rules. These rules will include a set of subroutines or executable programs with a common call syntax (also know as the DIPP API and language bindings) which may be executed by the DIPP Engine.

Because the API's are open standards, other programmers can construct additional features and enhancements for the DIPP system. The DIPPswitch software is built upon an open standard intended to encourage broad development in the future.

4.0 DIPP Taskforce Work Plan [top]

The DIPP Taskforce is working to produce a stable beta environment as well as begin to develop the necessary documentation to introduce the DIPP system in the standards process of the Internet Engineering Task Force (IETF).

The project team has been working with pre-alpha versions of DIPP that were used to support various agencies in the Boulder-Denver, Colorado region. These pre-alpha versions served as proof-of-concept for the DIPP system. Currently these early elements are being revised and codified to function as core elements of the DIPP system. The project team has a target delivery date of mid-November for the initial beta releases to a group of pre-determined sites on the Internet. Subsequently it is planned that the first public release of DIPPswitch 1.0 will be released later in first quarter of 1997.

Meanwhile, other members of the taskforce are working with the core materials to document and present them for review as an Internet standard with the Internet Engineering Task Force.

5.0 Summary [top]

The DIPP system, through the use of standard protocol and API's, is a model that parallels established procedures for producing, publishing, and disseminating printed information. Transfer of those concepts from print to digital environments will permit academic institutions and other information providers to create and manage information much more easily than they have been able to previously. The Distributed Information Processing Protocol will enable us to produce information in a timely manner without the current restraint of technical expertise. We can hope that DIPP will lower the thresholds for Internet users and foster easier information exchange.

hdl:cnri.dlib/december96-brett

An Integrated System for Distributed Information Services

ISSN 1082-9873

Copyright © 1996 George Brett