Stories

D-Lib Magazine
November 1999

Volume 5 Number 11

ISSN 1082-9873

BibRelEx

Exploring Bibliographic Databases by Visualization of Annotated Content-Based Relations

blue line

Anne Brüggemann-Klein
Technische Universität München, Germany
[email protected]

Rolf Klein
FernUniversität Hagen, Germany
[email protected]

Britta Landgraf
FernUniversität Hagen, Germany
[email protected]

red line

 

Abstract

Traditional searching and browsing functions for bibliographic databases no longer enable users to deal efficiently with the rapidly growing number of scientific publications. The main goal of our project BibRelEx is to develop a new method based on the visualization of content-based relations between documents such as cites, succeeds, improves with respect to. BibRelEx will therefore use these relationships for effective exploration. In addition, BibRelEx will take advantage of the additional insights into the area that can result from the aggregation of expert knowledge, which complements the specialized knowledge represented in the documents themselves. We are preparing to test this approach using a bibliographic database in a specific area of computer science.

1. Introduction

It is well-known how difficult it is to find relevant literature using only the conventional retrieval methods. Often one needs a survey article to get quick access to a certain area or to discover the central publications in a particular area. Here content-based relations between documents such as the citation relation are helpful. For example, one can detect survey articles by the fact that they cite many papers in a given area, whereas fundamental contributions are cited by many. Also, thematically related documents can easily be determined with the help of the citation relation since they have similar citations.

The answers to queries concerning content-based relations could, in principle, be output in textual form as lists. However, the following query demonstrates the value of a graphical visualization of relationships: What further work has been influenced by a given publication A? The result set for this query could be visualized -- as shown in Figure 1 -- by a graph whose nodes represent the documents that directly or indirectly cite document A, and whose edges express the citation relation, resulting in an at-a-glance picture of A's sphere of influence. In this example, documents are represented by keys like b-ct-93 that are derived from the author's names, the title of the document, and the year of appearance.

 Influence of a publication
Figure 1:  Influence of a publication (View a larger version of Figure 1 in a separate window.)

Under the citation relation, documents are arranged by fields of knowledge, independently of their language and of the words that happen to occur in titles or keywords. Thus, two substantial problems that are common with classical retrieval methods are avoided: the selection of suitable search words and the consideration of the context.

When searching for information, the advice of human experts in the respective area is valuable. For example, an expert can point to the few best publications in an area without providing an exhaustive list of less valuable material. Since human experts are not always around, it is highly desirable that such expert knowledge be integrated into the database. We suggest enriching the database by public annotations contributed by expert users, that can be attached to documents or to relations between documents. The relations annotated may be formal, e.g., cites, or more generally content-based, so that it would be hard to automatically extract them from the documents. For instance, a new document B could in a specific way generalize a document A that is already available in the database; see Figure 2.

 
Figure 2:  Incorporation of a generalization relation into the database

Not only does BibRelEx provide the means for entering public annotations; our system also allows all users actively working in a scientific area to customize their information spaces by adding their own views as private information. This could be achieved, for example, by entering additional publications into the database that are frequently referred to by the user, or by subjective annotations like the following:

There exist many individual solutions for particular aspects of visualization, knowledge management, and collaboration. With our project, BibRelEx, we will for the first time integrate most of these techniques. To summarize: With BibRelEx, content-based relations between documents such as cites, succeeds, improves with respect to can be visualized and used for a more efficient exploration. In addition, we encourage users to attach their own annotations to documents or to pairs of related documents. Such annotations may be either private or public. The aggregation of public annotations contributed by expert users represents insight into the area that exceeds the knowledge that is represented in the documents themselves.

In order to test our ideas in a real application, we are building BibRelEx on top of the bibliographic database geombib, which is described in Section 3. The restriction to this particular field enables us to enter -- in addition to the citation relation -- some content-based relations based on expert knowledge. Although our focus in this project is on a particular bibliographic database in a specific field, BibRelEx is developed in such a way that it can easily be applied to other fields of knowledge and other databases, or to digital libraries.

2. Related Work

Following up the references contained in a scientific document has always been an essential part of systematic study. It is not surprising that several systems or projects exist that aim to support some of the features built into BibRelEx.

Well known is the Science Citation Index(SCI) [ISI], which lists all publications cited by a document. In addition, the SCI offers additional search options. There is a register of all cited authors (citation index), which enables the following type of forward search: One can ask in which publications a selected author is cited. In addition to the bibliographic citation of a paper, SCI also shows the bibliography written by the author. Thus, a backward type of search is also possible.

D.M. Jones [Jon95] has originated the Hypertext Bibliography Project. A Web-based system, it employs hypertext links to establish citations, and sets up one Web page per document containing a list of links to papers that cite it. This project is already well under way. It covers a selection of major journals and conferences. Keyword search and access by author name are supported. The abstracts for many documents are available on-line.

The Trier Computer Science Bibliography DBLP is a related project. Ley [Ley97] took over parts of data from the Hypertext Bibliography Project to supplement his database. So far, all references contained in the conference proceedings of PODS, SIGMOND, VLDB, and in the magazine The Data Engineering Bulletin, and part of the references from the TODS magazine, have been entered.

R.D. Cameron [Cam97] suggests building a universal citation database that links all publications ever written through their citations. He discusses a distributed approach for such a database, using different servers for different sources such as journals, conferences, etc. In order to make searching over the Internet as efficient as possible, each server has to manage both the bibliographic data and citation references from and to the publications stored.

For the implementation of relation networks, hypertext systems like HyperWave [Mau96] are well-suited; they can manage both annotations and relations between documents.

Although representing relations explicitly by hyperlinks is a flexible method, it creates a few problems. Hyperlinks can be unstable and need to be conscientiously maintained. There may be a number of copies or versions of the same document in different formats, e.g., Postscript, PDF, HTML. These problems, and some basic approaches, are discussed in a D-Lib Magazine article by Caplan and Arms [CA99]. The International DOI Foundation proposes to use Digital Object Identifiers (DOI) [DD98, Pas99] instead of URLs. A DOI is a persistent identifier for managing copyrighted material. SLinkS [Hel99] is an intermediary service which uses a URL templating language to construct a URL based on bibliographic data and metadata, to give information about what the URL leads to and how it may be used.

Furthermore, there are some projects concerning automatic citation indexing, for instance, see Bollacker, et al., and Hitchcock, et al., [BLG98, HCH+98]. BibRelEx differs from these projects in that our focus is not on the creation of the indexes, but in the effective use of citation indices in literature search. In addition, we will exploit more flexible content-based relations that result by aggregation of expert knowledge.

To our knowledge, there is, at present, only one system that enables a global overview based on literature references. With VxInsight of Sandia [DHJ+98] fields of knowledge are represented as landscapes, over which the user can fly using virtual reality techniques. The deeper one flies, the more sub-areas become visible. On the lowest level, the titles of the journal articles are represented. The mapping algorithm clusters documents by the number of common citations the papers list. The citation data were taken from the Science Citation Index mentioned above.

3. The Bibliographic Database Geombib

Before we present the design of BibRelEx, a brief description of the sample the bibliographic database geombib [EJS98] that we are currently using is in order. At the time our project started, geombib contained about 8700 bibliographic entries in the area of computational geometry. It references magazine articles, conference contributions, and technical reports. All entries are in BibTeX format; this is one of the reasons why geombib became so popular, because this format helps users create their own bibliographies in LaTeX documents.

Geombib is being maintained by B. Jones of the University of Saskatchewan, and it is updated by a community effort. Typically, users download the whole database which consists of a single file. To the local copy, users can add new entries of recent publications, or correct incomplete or erroneous entries. After four months, the resulting file is compared against the original one, and the differences are submitted to the administrator. From the data submitted, a new release is compiled and distributed one month later.

This approach works surprisingly well, thanks to the effort of many researchers who do not shrink from entering whole proceedings volumes. As a result, the database covers most of the existing publications in this field, including technical reports and workshop proceedings.

4. BibRelEx

4.1 BibConsist and BibManage

Once the decision for building BibRelEx on top of an existing database had been made, two consequences were immediate. First, the system had to be designed in such a way that geombib in its current form and BibRelEx could coexist; geombib as it currently exists has to remain fully operational. Second, we also had to provide a critical mass of citation information and annotations, so that geometers can use the new exploration methods to their advantage.

In order to exploit content-based relations among documents for navigation, the relations themselves must be present in the database. Geombib already provides, for each database entry, optional fields named cites, precedes, succeeds, annote, but less than ten percent of them have so far been filled in.

Therefore, we started our project by entering the citations contained in the proceedings of the large conferences (SoCG, CCCG, etc.). While most of the papers published in the proceedings were already represented in geombib, much of the cited work was not; together, about 1500 new entries were generated and about 4850 links filled in.

When inserting new data records, we encountered a problem. In geombib, each entry is identified by a unique key, which is generated according to fixed rules from the authors' names, the initial characters of the significant words in the title, and the year of appearance. Upon insertion into geombib, duplicate keys are detected automatically. However, it is not uncommon that data-entry errors result in duplicates that give rise to different keys. This happens, for example, if the authors' names have been entered in different order, so that the keys generated by geombib differ. Also, misspelling the paper's title can result in different keys.

In order to detect such duplicates, we have developed the tool BibConsist [Lan97]. It checks if the corresponding fields of two entries have similar contents. D. Knuth's soundex code [Knu73] is used for measuring the phonetic similarity of words. Using BibConsist we were able to detect many duplicates in geombib.

BibConsist can be used twice in the update cycle of geombib. First, at merging time, the administrator can use it to detect duplicates in all updates suggested by the users and to guarantee the consistency of the new geombib release. Second, while creating and formatting entries, BibConsist can be used to prevent users from entering faulty records.

In the four-month period between two geombib releases the user typically maintains a local version of the database that contains all recent updates suggested by the user. In addition, many users are maintaining their own bibliographic databases containing the data of such publications that are interesting to them, and that are likely to be cited in their own work, but do not necessarily belong to the area of computational geometry (for example, a textbook on topology from which but one theorem is quoted).

In order to keep these different databases consistent, we are currently developing the tool BibManage for maintaining public and private local versions of the database under periodic updates. Each time a new release of geombib appears, this tool checks whether it contains all updates suggested by the user. In case an update was suggested but has not been performed by the administrator, the users can choose whether to submit their suggestions again or insert a corresponding entry into their personal bibliographies. BibManage also checks whether some documents contained in the personal bibliography do now appear in geombib, due to some other user's request. Such entries are then removed. Furthermore, the tool guarantees transparency between updates, in that all local databases are treated like a single one. Internally, however, the personal bibliography and the update database have priority over the current geombib version. New update suggestions to be submitted to the administrator are automatically created by BibManage.

In addition, a user interface for editing existing records and entering new records, annotations and relations into either database is provided.

Altogether BibManage supports the update process of geombib in the following three situations:

Figure 3 shows the use of BibConsist and BibManage in the updating cycle of geombib.

 
Figure 3:  Update cycle of geombib

4.2 Visualization

As was pointed out in the introduction, visualizing the content-based relations among documents is a key issue in our approach. We have examined a number of visualization systems in order to find a system suitable for BibRelEx. Such a system should enable a three-dimensional representation of relation networks and offer comfortable navigation methods. Moreover, it should be possible for the user to select nodes and edges by mouse-clicks, thus displaying, in a text window, further information like the full bibliographic entry or annotations attached to the object.

Well suited layout algorithms for our purposes are force-directed algorithms such as the spring embedder [KF94, Sim96]. This algorithm is based on the simulation of a mechanical process. Nodes carry mutually repulsive charges, and edges are modeled by springs that strive to contract. Initially, all nodes are placed randomly. Then a series of iterations are performed in which the nodes move in space conditioned by the forces until a minimum energy state of the system is reached. This state corresponds to a proper layout. In the resulting graph, highly-connected nodes are placed close to each other. Applied to citation networks this means that the cluster of nodes represent documents that have similar references. Thus the user can easily detect which documents are related by content.

In its original form, the spring embedder algorithm gives good results only for small-sized graphs. With large graphs, heuristic methods must be employed in order to achieve an acceptable run time behavior. One achieves good results with the GEM3D-Algorithmus [BF95], which uses an additional "virtual temperature" for the adjustment of nodes.

The visualization system should also be able to cooperate easily with BibManage in order to combine the classical retrieval methods offered by BibManage with the visual study of the information space.

Unfortunately, it seems that no system exists at present that completely fulfills our request. Therefore, we decided to use the LEDA library for implementing a large part of the functionality described above. LEDA provides basic data structures for graphs, numerous geometrical algorithms and components for the creation of a user interface. In addition, LEDA is extendable. Figure 4 has been produced with LEDA; it shows part of the citation net that is contained in geombib at present. Represented are those documents that contain the word "Voronoi" in their titles. Only such documents are considered which refer again to other documents. In the figure, one can easily detect "cd-vdbcd-85" as a central work.

 Representation of a citation net with LEDA
Figure 4:  Representation of a citation net with LEDA (View a larger version of Figure 4 in a separate window.)

4.3 Future Work

As of today, LEDA offers some support for three-dimensional graph representation. A three-dimensional spring embedder is available; however, it is rather slow for our purposes. Only a reduced set of navigation methods is offered. Therefore, in our next step we will start to extend the representation of diagrams. For example, we will incorporate a faster layout algorithm, e.g., GEM3D. To obtain better means of navigation, we will test the following approach: convert the layout produced by LEDA into a VRML file and then use a VRML Viewer such as VRwave. Since we want to combine the visualization with classical retrieval methods, we must check whether this method provides a sufficiently dynamic layout.

Parallel to this implementation work, the database must be extended in such a way that a critical mass of information is available, making the use of BibRelEx worthwhile to the geometry community. Due to our previous efforts, there are at present about 13,000 references available in geombib, surely enough to test the system under a real load. To provide additional content-based relations based on expert knowledge, we will sift the available literature in the areas of Voronoi diagrams and on-line algorithms, put these publications in relation to each other, and annotate them.

Another interesting question is how to maintain the database contents over time. We feel that this task should no longer be left to volunteers. Rather, we suggest that the authors themselves submit, along with their papers, a proposal for an annotated geombib entry to a conference or to a journal. The referees could, without additional effort, check whether the proposed entry is correct and complete and then forward it to the database manager. This approach, which has also been suggested by Cameron [Cam97], is in the authors' best interest.

References

BF95
I. Bruß and A. Frick. Fast interactive 3-D graph visualization. In Proceedings of the 3rd International Symposium on Graph Drawing (GD'95). Springer Lecture Notes in Computer Science 1027, pages 99-110, 1995.

BLG98
Kurt Bollacker, Steve Lawrence, and C. Lee Giles. CiteSeer: An autonomous Web agent for automatic retrieval and identification of interesting publications. In Katia P. Sycara and Michael Wooldridge, editors, Proceedings of the Second International Conference on Autonomous Agents, pages 116-123, New York, 1998. ACM Press.

CA99
P. Caplan and W.Y. Arms. Reference linking for journal articles. D-Lib Magazine, 5(7/8), July/August 1999.
http://www.dlib.org/dlib/july99/caplan/07caplan.html

Cam97
R.D. Cameron. A universal citation database as a catalyst for reform in scholarly communication. First Monday, 2(4), 1997. http://www.firstmonday.dk/issues/issue2_4/cameron/index.html

DD98
Lloyd A. Davidson and Kimberly Douglas. Promise and problems for scholarly publishing. Journal of Electronic Publishing, 4(2), April 1998.
http://www.press.umich.edu/jep/04-02/davidson.html

DHJ+98
G.S. Davidson, B. Hendrickson, D.K. Johnson, Ch.E. Meyers, and B.N. Wylie. Knowledge mining with VxInsight: Discovery through interaction. Journal of Intelligent Information Systems, Integrating Artificial Intelligence and Database Technologies, 11(3):259-285, 1998.
http://www.cs.sandia.gov/projects/VxInsight/VxPaper.html

EJS98
J. Erickson, B. Jones and O. Schwarzkopf. More information about the database.
http://www.cs.duke.edu/~jeffe/compgeom/geombib/geombib_1.html

HCH+98
Steve Hitchcock, Les Carr, Wendy Hall, Steve Harris, Steve Probets, David Evans, and David Brailsford. Linking electronic journals: Lessons from the open journal project. D-Lib Magazine, December 1998.
http://www.dlib.org/dlib/december98/12hitchcock.html

Hel99
E.S. Hellman. Scholarly link specification framework (s-link-s), 1999.
http://www.openly.com/SLinkS/

ISI
Institute for Scientific Information. Science Citation Index.
http://www.isinet.com/products/citation/citsci.html

Jon95
D.M. Jones. The hypertext bibliography project, 1995.
http://theory.lcs.mit.edu/~dmjones/hbp/info.html

KF94
A. Kumar and R.H. Fowler. A spring modelling algorithm to position nodes of an undirected graph in three dimensions. Technical report, Department of Computer Science, University of Texas, 1994.

Knu73
D.E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, Reading, Massachusetts, 1973.

Lan97
B. Landgraf. BibConsist: A program to check BibTeX files for inconsistencies, 1997.
http://wwwpi6.fernuni-hagen.de/Forschung/BibRelEx/BibConsist.html#BIBCONS

Ley97
M. Ley. Die Trierer Informatik-Bibliographie DBLP. Technical report, Universität Trier, FB 4, 1997.

Mau96
H. Maurer, editor. HyperWave: The Next Generation Web Solution. Addison Wesley Longman, Reading, Massachusetts, 1996.

Pas99
Norman Paskin. DOI: Current status and outlook. D-Lib Magazine, May 1999.
http://www.dlib.org/dlib/may99/05paskin.html

Sim96
S. Sim. Automatic graph drawing algorithms, 1996.
http://www.db.toronto.edu/~simsuz/papers/grafdraw.ps.gz

Copyright � 1999 Anne Brüggemann-Klein, Rolf Klein and Britta Landgraf

blue line

Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | Next story
Home | E-mail the Editor

blue line

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/november99-landgraf