The Digital Libraries Initiative (DLI) is a federally-sponsored
research program to understand and foster our use of digital information
at home, at school, and at work, now and tomorrow. This spans
a broad range of questions:
Six university-led public/private partnerships are
examining these and related issues. In this exhibit, we invite
you to explore some of the questions motivating this research
and to examine some of the findings. But these are dynamic projects
in a fast-changing world; take a moment to visit the web pages
of each project to see what's new.
Three Sponsoring Agencies
National Science Foundation <http://www.nsf.gov/>
Defense Advanced Research Projects Agency <http://www.darpa.mil/>
National Aeronautics and Space Administration <http://www.nasa.gov/>
Almost any kind of information can exist in digital
form - music, images, text, motion pictures, speech, and so on.
And the universe of this information is expanding. Some is created
digitally - satellite images or remote sensing data. Others must
be converted - the purpose of large scanning projects of historical
collections, corporate archives, and technical journals. But different
kinds of digital data have different storage and other requirements,
and some, like video, pose particularly complex retrieval problems.
A "document" can take many forms but is characterized by three properties: content, or what you are trying to say; structure, or how the content is organized; and format, how the content and structure are encoded so that we can store, find, and use these documents. The "California Dams" research shows how structure can be independent of content. In this demo, you can look at the same content, information about all the dams in the state of California, as either an image of a page, as text which can be searched, or as a tabular display of a subset of the information, depending on a query. Try it!
The formats in which we store documents can differ substantially. For example, images are large; text is small.
Retrospective conversion is the process by which information in print format can be expressed in digital form - either as a sequence of characters or as a digitized image. Many rare and unique materials already exist in print on paper, and one important aspect of building large digital libraries is scanning, or digitizing, these materials and then indexing them for storage and future retrieval. Scanning is a lot harder than it looks: Researchers at the Alexandria Project have found that reliably scanning an aerial photograph can require 12 separate steps and 15-20 minutes. The item must then be indexed for future use - a separate process requiring several additional steps. How to automate these processes, and reduce the need for intensive human involvement, is an area of research.
Video embodies several media that can be taken apart, interpreted separately letting us employ different tools for different components, and then re-integrated. For example, some of the research at Carnegie Mellon is devoted to automatic speech recognition, converting speech to text so that we can search the text using one set of tools but search images using other strategies. Storing and searching different media are separate research issues; a first step is to take the complex "document" apart, so that we can differentiate among the components.
Standard Generalized Mark-up Language (SGML) is a set of codes that lets us subdivide a document into components (like chapters and paragraphs). Recognizing this underlying structure means that we can partition documents in consistent ways, save them efficiently, and retrieve only the relevant parts. SGML also lets us preserve display so that a page in an engineering journal displays on the screen in the same way that it appears in print.
Digital information is stored on computers across
the country and throughout the world. Advances in computing and
communications technologies mean that separate computers can be
networked, and users can find and use information any place and
at any time, restricted only by conditions applied by the owners
of the information. But some kinds of documents, like images,
are so large that simply downloading them can pose performance
problems. And in a rapidly changing environment, not all networks,
computers, and collections can or will "speak" the same
language.
So two important groups of questions are: How do
we build systems that can interoperate - or let users work across
heterogeneous collections and systems without worrying about compatibility
or learning different procedures? And how can we store material
so that people can find what they want more easily and efficiently
-- either by partitioning or by describing it?
What goes on behind the scenes?
Heterogeneity exists at many levels - from search
systems that end-users see down to the switches and routers that
process and manage the flow of bits and bytes over the network.
To cope with rapid change and increasing variety, researchers
at Stanford are devising sets of computing specifications or rules,
called "protocols". Protocols, like zoning codes in
architecture or grammar in language, do not require similar programs
and systems to be identical; they do establish a design "envelope"
or framework that permits variation in specifics to co-exist.
Another approach to coping with rapid change is use
of "agents". We can think of an "agent" as
a program that provides a service that can adapt to new information
without significant - or any -- re-programming. Like protocols,
agents exist behind the scenes - we need never see them. Researchers
at the University of Michigan are working on the notion of societies
of agents, collections of computer programs each providing
a specific service that team up to achieve a goal.
What's stored where? What's processed where?
Using agents also means that demand on the network
can be reduced - and a crowded network is increasingly an issue.
Even without congestion that comes from rapid growth in the numbers
of users, some kinds of documents are so large and used so infrequently
that we only want to store them once and download them as needed.
But retrieving them in their entirety would tie up lines unnecessarily
and might result in a document at the desktop that is larger than
we need.
Researchers at the Alexandria Project at the University
of California, Santa Barbara, are experimenting with a set of
mathematical techniques called "wavelets" for storing,
partitioning, and retrieving extremely large images. These techniques
can support progressive resolution of portions of an image so
that users browse a coarse version of it or can zoom in on a detail.
This has several implications for performance: In terms of storage,
it means that lower-level resolution data that are accessed more
frequently than the higher resolution information can be stored
in faster devices for efficient browsing. At the desktop, only
some of the data need be transferred for local reconstruction
of an image stored elsewhere.
Engineers at the Alexandria Project, like researchers
at Berkeley and Carnegie Mellon, are investigating retrieval of
images. Alexandria's approach is based on characteristics of texture
and color. Images can be segmented and segments compared so that
someday, not too far off, we can ask a collection of aerial photographs
this question: "Show me all the images with cornfields in
them."
Describing Data: What is Metadata?
Metadata is data about data: A metadata record can
describe a collection or an individual item - image, text, database,
video clip, and so on. We can store the metadata records separately
from the material they describe so that when users request information
about images, documents, or collections, less data is sent. This
means that the system responds more quickly and the demand on
the network is reduced.
Not all metadata records are the same because what
we need to describe a video is different from what we need to
describe a database. But we need enough similarity among records
so that we find related material across different media. So, one
cluster of research questions is: "What are the minimum requirements
for all items? And what are additional requirements for collections
of related items?" Because creating metadata is labor-intensive,
a second set of questions asks: "What can we automate? And
how do we do it?"
Metadata can let us improve performance. But like
progressive retrieval of images, it is also a tool for rapid browsing
of materials. Browsing is one way to select information. Before
we can browse a handful of images, we have to find relevant material
in the first place. In the expanding universe of heterogeneous
information, finding relevant material is a problem with many
facets. How do we look for information - what concepts and words
do we use? And once we bring resources to the desktop, what tools
can we use to work with it?
Human/Computer Interfaces:
Human computer interface design deals with how the
display is organized. DLITE,
developed at Stanford, helps users integrate the results of many,
disparate services, supports sharing and reuse of information,
and is "extensible", meaning that it is designed to
encourage others to build additional capabilities on to it.
PAD++ was originally developed with funding from
DARPA but is being integrated into the research program at the
University of Michigan. The Highly-Interactive Computing Research
Group at Michigan is also studying use of digital libraries in
schools. With support from the University of Michigan, the Michigan
Department of Education, and the Ann Arbor Public Schools as well
as from NSF, NASA, and the DLI, these researchers are undertaking
a broad range of investigations in learning and technology.
When is meaning the same? And when is it different?
We usually look for information by submitting a "query".
Problems frequently arise when the same word can have different
meaning, depending on the context of the word or the intent of
the searcher. The interface to the spatial collections of the
Alexandria Digital Library shows you the notion of location as
a way to search collections organized spatially without resorting
to words. But the Interspace
shows you that the same idea, "California", can also
be a way to find information related to environmental information.
They're the same - but they're also different.
How do we find what's useful?
Stanford University's SenseMaker
is one way to help users find related materials. The program is
designed to run over a group of documents stored locally or on
the web, and find the ones that are similar.
Another way to select what's useful from what's not
is through iterative searching - taking words and concepts from
one set of documents and asking the system to search again. This
is the approach embodied in IODyne, an experimental system developed
at the University of Illinois, Urbana-Champaign, which uses traditional
tools like subject thesauri and new tools like concept spaces
to get at documents or parts of documents where the concepts are
the same, but the words may be different.
We're used to notions of abstracts and summaries
as tools to help us find documents that are useful. We're less
used to the notion of visual abstracts, but researchers at Carnegie
Mellon's Informedia project have devised a way to provide visual abstracts.
Try it!
One of the powerful advantages of digital materials is that once we find relevant information, we can work with it at the desktop without resorting to yellow markers, scissors, and re-typing. But because it is so easy to manipulate the information, we need tools that will help us authenticate materials as well as manage them. The SCAM program, developed by researchers at Stanford University, asks the question, are these documents similar? SCAM has already proved useful in identifying instances of plagiarism. Try it!
At the desktop, at home, at school, and at work, relevant information in digital form promises to enable us to explore relationships among different kinds of information. One example of these tools is Berkeley's "multivalent document model", which supports annotation, overlays of different kinds of information, zooming in on details, and backing off for a broader view. It's already been used in flood recovery efforts in California. Try it!
Learn more about the DLI!
©1997 Corporation for National Research Initiatives