Relaxing Assumptions . . . Stretching the Vision

A Modest View of Some Technical Issues

Ronald L. Larsen
U.S. Defense Advanced Research Projects Agency
(DARPA)
Arlington, Virginia
[email protected]

D-Lib Magazine, April 1997

ISSN 1082-9873

On March 9-11, 1997, the National Science Foundation (NSF) sponsored a "Planning Workshop for Research in Distributed Knowledge Environments (DKE's)." This story is based on one of two plenary papers given on March 10, 1997. The second was given by William Y. Arms and also appears in this issue. All slides, transcripts, and workshop notes will be made available shortly by the University of Michigan, School of Information.

Herein I present a modest examination of seven technical assumptions which I believe have historically and substantially influenced the development of networked information systems. I suggest that these assumptions still influence our ability to conceive of innovative designs and services, while their validity is increasingly becoming open to challenge. Especially in planning long term fundamental research, we need to step beyond the constraints of the past and the present, and examine afresh the options for the future. My intent here is to consider "challengeable" assumptions. I make no claim of completeness, nor even of appropriateness. Instead, I hope to challenge you to refine my list or to come up with your own. Regardless of the specific list, the underlying purpose here is to explore the effects of relaxing some of our long-held assumptions, and to consider also the potential counter-effects, or unanticipated outcomes, which may result.

Assumption 1: Computing and communication resources are scarce and inflexible

When resources are scarce or costly, their utilization is carefully managed, monitored, and often mediated. Access to commercial information services from university and corporate libraries, for example, is frequently mediated by a reference librarian. This is primarily due to a combination of cost, complexity, and variety of the underlying services available. Mediation by a trained professional is seen as the means of providing value-added, patron-oriented services while controlling the costs of on-line services. Queries to these systems can be quite cryptic and laborious to construct, and responses may be voluminous and costly (particularly for the ill-formed or ill-informed query).

Relaxing this assumption enables rethinking the manner of interaction between the user and the information source, with the potential for removing the need for a professional mediator. These older systems are built on an underlying assumption of a narrowband, text-based query interface. The typical query is a sequence of, perhaps, 20 - 50 characters which has its roots in dial-up technologies capable of delivering a few hundred characters per second. But today's networks and the networks of the future are many orders of magnitude faster than this. Broadband, active networks accessed through high performance workstations offer the potential of semantically and contextually rich query expression and interaction with the information space.

The corollary to this relaxed assumption is that "more is better," that more bandwidth and higher levels of performance will necessarily improve one's information access. But common experience on the Web suggests the situation is more complex. Duplex bandwidth not only expands the user's access to potentially useful information, but also expands the user's availability (and potential vulnerability) to others. The risks include increased exposure to materials of marginal interest, as well as materials with no enduring value whatsoever, such as junk mail.

As John Cherniavsky indicated in his remarks, it is not hard to envision a world 20 years from now in which computing and communication resources will be essentially unlimited, particularly in comparison to that which is commonly available today. In 1980, I participated in a National Aeronautics and Space Administration/American Society for Engineering Education (NASA/ASEE) summer study considering the progress that could be achieved in space exploration and utilization over a 50-year time period, unconstrained by the fiscal realities of the time. The only constraints imposed were those of scientific and engineering discipline. Namely, anything proposed had to be rigorously investigated and substantiated for feasibility (not affordability). At first blush, this may seem a somewhat ludicrous approach. But the strategy served to clear our collective consciences of traditional resource constraints, particularly time and money. The only remaining constraint was the intellectual power required to envision a different future. What emerged was an extraordinarily creative exploration of interstellar exploration, lunar and asteroidal mining, and earth resources monitoring. It is this kind of attitude and approach which may shed the greater light on the future of distributed knowledge environments.

I suggest that computing and communication resources are not scarce in the future we envision. I also suggest that money is not the constraining resource we typically assume. (Hmmm, have I really gone off the deep end?) I contend that this world is idea-constrained, not resource-constrained. If we put a dynamite idea on the table, one that sweeps others away, the resources will be there to support it.

What kinds of ideas have this potential? Resource constraints introduce intermediation in information retrieval -- intermediation to create viable queries that result in manageable lists of "hits." What happens if we are as unconstrained in our ability to state a query as we are in response to getting the material back. What if we could pose queries that contain not only descriptions of the subject matter being sought, but also the context of the inquiry and the type of information being sought? If one could do that in a much richer way than we are able to do today, then, perhaps, our information retrieval systems would be sufficiently well-informed as to our needs to avoid 200,000 responses to a simple query.

Assumption 2: Metrics focus research productively.

The effect of this assumption is that incremental advances dominate community attention, leaving qualitative breakthroughs at risk -- tantamount to buying research as commodity yard goods. The result is that once-useful metrics, such as precision and recall, bias continuing research in information retrieval, despite the fact that these metrics arose in the context of batch processing.

The challenge here is to relax our traditional approaches to metrics, to seek ways of transforming these familiar points of light into expanded fields of dreams, to find new metrics more appropriate to global, heterogeneous, interactive environments. The objective is to look beyond the familiar community-wide measures of performance and to qualitatively expand potential areas for exploration. But the risk is that inadequate charts of the new territory leave both explorers and pioneers at risk. Metrics are required, but to rigidly bind a community to metrics grounded in prior generations is to foreclose serious exploration of qualitative breakthroughs.

Assumption 3: Better search engines yield better search.

Search engines owe much of their historic development to an implicit assumption of a well-organized, relatively homogeneous collection (the type of collection one would typically find in a library or commercial abstracting & indexing database, for example). The Web violates this assumption. Information sources and resources on the Web are highly diverse, distributed, and heterogeneous, with greatly varying content and quality. The "end-game strategy" of search (alternatively viewed as the "hunter/gatherer" model of information seeking) loses its effectiveness as information volume and source heterogeneity grow. Increased document and information density resists discrimination by traditional search technologies.

Relaxing this assumption suggests considering other "orthogonal" attributes of the information space, such as context-based value and trans-media semantic similarity measures. Understanding search as the end-game exposes the assumption that the user has gotten to the point where specific results can be specified, sought, and identified. It raises the question about what the opening game and mid-game might be. Making a stronger search engine merely focuses more intently on the back end of the information seeking process, when the more striking contemporary problems may exist at the front end.

Metaphorically speaking, if we think of search engines as magnifying lenses passing over piles of sand, looking for just the right grains of sand, the inevitable result of dramatic increases in the quantity of sand is that increasing quantities of sand will meet the selection attributes of the lens, resulting in potentially many more grains of sand of marginal relevance within the field of view. As information density increases, there is little more that a search engine can do than to register all of the objects which share a terminological attribute with the stated query.

So, what is the opening or mid game? I don't claim extraordinary insight here, but I do look for orthogonal dimensions to the problem. Consider value-based measures, for example. Can we imagine an information retrieval environment which considers the context of the user's needs? Can we envision trans-media semantic similarity measures, in which the intellectual content of an image, a graph, or a formula would weigh as heavily as the words used in the text? Can we deploy a network-based peer review process comparable to that upon which our traditional scholarly journals depend?

What do we risk in considering such factors? That information seekers will need to confront increasing complexity with a degree of increased sophistication. The well-known paradigm of search would give way to a rich toolbox of filters on orthogonal measures of content, context, and value.

Assumption 4: The objective is to find the correct answer.

If one assumes that the typical user is seeking the answer to a well-formed question in a global information space, then one can be led down a path of ever-increasing complexity involving content- and context-sensitive multi-dimensional, trans-lingual search among semantically interoperable heterogeneous repositories with result ranking, relevance feedback, and so on. The inevitable result is that system complexity, intended to serve the user's needs for more refined information tools, instead confounds all but the most sophisticated users in their well-intentioned search for information. Increased complexity of search tools is not likely to significantly assist the average Web searcher, whose queries rarely include more than two terms.

Recasting the objective to perceiving information spaces at variable resolutions and levels of abstraction may serve the needs of many Web-based information seekers more effectively. Such an approach recasts the objective from one of finding an answer to one of understanding an information space. The risk in such an approach is that the focus may shift to browsing haystacks when the requirement demands seeking needles.

So is the typical user of networked information seeking the "right answer"? Perhaps, but in most cases, I would surmise not. Assuming so leads to complexity beyond anything we ever had before (recall that the typical query on the Web is only one or two words).

Relaxing this assumption involves recasting the objective and the practice of seeking information as a process of working through levels of abstraction, rather than attempting to zero in and drill down to a particular piece of information. The alternative is to present to the user a Gestalt view of the information space, and to provide a sense of the way it is laid out, rather than jumping right to the end game of search. "Fly-through" metaphors come to mind as an alternative, but these raise immediate questions of the dimensions and character of the field of view, in addition to the means by which the user does, indeed, drill down to the materials most relevant to the problem being addressed.

Assumption 5: The correct answer lies in the information.

If one assumes that an answer exists, and that answer can be found in the body of information being searched, then this leads to a focus on information artifacts. As a result, correlations that require collaborative expertise among individuals interacting with information may be missed. Relaxing this assumption leads to a requirement for seamless interoperability among searching, authoring, and collaboration facilities, with the derivative requirement for these capabilities to be integral to Distributed Knowledge Environments (DKEs). One of the open problems here is to satisfy the complex quality of service (QoS) requirements for fixed and mobile, synchronous and asynchronous interoperation. Some of these requirements are being explored further in DARPA's programs in advanced networking, global-mobile communication systems, and intelligent collaboration and visualization.

Distributed knowledge environments are composed, at least, of collaboration, information analysis, and authoring facilities. If a user is to find, interact, and collaborate with both information resources and people in a network environment, then quality of service issues become very significant, as the system must seamlessly integrate synchronous as well as asynchronous sources of information and services.

Assumption 6: Search is the place to start.

The effect of this (historically valid) assumption is to focus, prematurely, on inadequate analytic tools for global, distributed, heterogeneous information sources. The result is that despite its potential, the Web remains largely unusable for vast numbers of serious professionals. Relaxing this assumption requires exploring new metaphors and algorithms for hierarchical abstraction, analysis, and visualization. Rigorously ill-defined but instinctively appealing concepts such as semantic signal processing suggest directions to pursue here. Early investigations in this direction indicate the need for new mathematical concepts and constructs. An emerging view is to consider the expert information analyst as a master craftsman, armed with a library of analytic, discrimination, and visualization tools to explore n-dimensional information spaces, seeking appropriate chunking, correlation, and visualization primitives. The risk is obvious: the theoretical foundations may be too weak. But this is not cause to be daunted, but, instead, cause to develop the necessary foundations, theories, and techniques.

Relaxing this assumption requires being open to new metaphors and algorithms for hierarchical abstraction, analysis, and visualization. The search metaphor (which I will characterize as a one-size-fits-all solution), for example, may need to make room for a more flexible metaphor (such as the toolbox metaphor used in signal processing). A richer set of tools enables the analyst a richer opportunity for discrimination along dimensions relevant to the immediate problem. A good expert, using the right kind of tools, could discriminate the signal (and, hence, the information) they need from the vast and diverse resources available.

So, we have begun wrestling with the question of how some of these ideas from signal processing and related disciplines might influence our thinking on the future of digital libraries and information retrieval, and have begun playing with concepts like "semantic wavelets" and "semantic signal processing", with little rigor behind what these terms actually mean. Casual reflection on the process we are engaged in suggests an evolutionary process, for which the immediate next steps are to transform the information professional from a hunter/gatherer to a master craftsman.

Assumption 7: Distributed Knowledge Environments (DKEs) are for everyone.

This assumption is very attractive to those seeking justification for federal investment. The result, however, is an inordinate emphasis on near-term results for low-end users, at the expense of long-term progress derived from the strategic opportunities represented by high-end users. Progress at the high-end enables the longer term mass deployment of new capability, and truly challenges the technology. Relaxing this assumption opens the way for building DKEs for elite teams which are highly mobile and distributed. While these are the technical challenges which are strategically relevant, they also risk a perception that DKE's focus on the elite requirements of the highly trained, rather than the broad-based requirements of the masses. The oft-neglected secondary effect, however, is that those capabilities developed for elite teams quickly become part of the infrastructure available to all.

Conclusion: Out of the Box

We are in Santa Fe because we need people who can think out of the box. And we need your help to identify the opportunities, clarify the challenges, and define the "generation after next" tools. I recall Bill's comments on "the book" - of course, we all still carry books. On the way out here I was reading Geoffrey Nunberg's The Future of the Book. Despite the title, little is said about the technological future of that artifact which we now know as "the book." Why, for example, must it be embodied exclusively in paper? What precludes a book from having digitally-rendered representations of its content between its covers? Just as a network-delivered digital rendition of an intellectual work can contain a wide diversity of materials, from text through multi-media, what precludes a more traditional print text from including materials in an appropriately encapsulated rendering, say, on the inside of the cover? Or on a digitally active paper? Ultimately, can we envision a book-like artifact that is, in fact, independent of paper?

I firmly believe that the future has room for digitally rendered, as well as physically rendered (as in "books") containers for the intellectual output of humans. But whereas some would cast the digital artifact in counterpoint and in competition with the physical artifacts, it seems more likely to me that these will increasingly comprise a broad spectrum of information resources, more blurred by similarity than distinguished by difference. Our distributed knowledge environments must inevitably reflect both this continuum and this diversity.

hdl:cnri.dlib/april97-larsen