Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Conference Report

spacer

D-Lib Magazine
January/February 2008

Volume 14 Number 1/2

ISSN 1082-9873

Next Steps for E-Science, the Textual Humanities and VREs

A Report on Text and Grid: Research Questions for the Humanities, Sciences and Industry, UK e-Science All Hands Meeting 2007

 

Stuart Dunn and Tobias Blanke
Arts and Humanities e-Science Support Centre
Centre for e-Research, King's College London
26-29 Drury Lane
London WC2B 5RL
+44 20 7848 2709

Correspondence: <stuart.dunn@kcl.ac.uk>

Red Line

spacer

Although Bill Gates famously remarked, 'The paper-based book, magazine, or newspaper still has a lot of advantages over its digital counterpart' (Gates 1996: 130), the digital text is becoming increasingly important in the arts and humanities communities. The arts and humanities have long enjoyed a close relationship with the library and information science community, to the great benefit of both. Recent strategic developments in the former, however, particularly the launch in 2005 of the AHRC-JISC-EPSRC Arts and Humanities e-Science Initiative (http://www.ahrc.ac.uk/e-science), necessitate a re-examination of what arts and humanities need from digital textual technologies. Although the 'digital humanities' traditionally have an extensive overlap with textual analysis and research, the obvious point must be made that the use and research of text is by no means confined to the humanities: the scientific and industrial sectors are just as dependent on text. Therefore, engagement between the existing body of expertise and experience in the text-driven digital humanities and the advanced network and grid technologies ushered in by the e-science programme, is in the interest of all. To this end, the Arts and Humanities e-Science Support Centre (AHeSSC) at King's College London organized a workshop at the 2007 e-Science All Hands Meeting in Nottingham (http://www.allhands.org.uk/news/textgridws_call.cfm), entitled Text and Grid: Research Questions for the Humanities, Sciences and Industry. This report summarises the main points that emerged from that workshop, and outlines a medium-term research agenda for how that process of engagement can proceed.

Four papers were given in the workshop. The first, presented by Dolores Iorizzo and Brian Fuchs of Imperial College, gave an overview of what the humanities need from a global grid infrastructure (Crane et al 2007). The crucial point they made was that digital libraries are far more than simple digital surrogates of existing conventional libraries. They are, or at least have the potential to be, complex Virtual Research Environments (VREs), but researchers and e-infrastructure providers in the humanities lack the resources to realize this full potential. There is a need for service-oriented architectures that, at the very least, deliver textual resources from digital libraries to the researcher's desktop. Remote access to content, however, is only the most basic of requirements. Within such VREs, the concepts of 'reader' and 'author' need to be rethought. So-called Web 2.0 technologies, allied with service-oriented data delivery services, would enable readers to interact creatively with texts, by (for example) selecting elements from different libraries, or using customization tools remotely, to annotate, aggregate, compare and structure text according to their own research needs. In other words, placing digital libraries within global infrastructures would allow the reader/user to break down the distinction between library A, library B and their own desktop. This would also allow the same user/reader to define particular chunks of text they wished to examine, rather than having to select and download the entire text, thereby saving on computational and other resources, as well as empowering the researcher.

This vision provided a background for the three following papers, which illustrated specific technical angles that the latest approaches are taking. Sander Wubben of Tilburg University presented on the Open Boek project (Paijmans and Wubben 2007), part of Tilburg's 'Continuous Accesses to Cultural Heritage' (CATCH) programme. This example of information processing extracts numerical information from archaeological literature – a living example of the kind of creative interaction envisioned in the first paper. The Open Boek system is subjected to a memory-based learning process using k-nearest neighbour (KNN). This allows the user to extract information in the form of (for example) coordinates, dates and dimensions, all key elements of information in the archaeological literature. Although this is based on a framework in which the texts are indexed at document and page level, this is a clear example of the need identified by Iorizzo and Fuchs to identify and extract particular elements of text, and particular types of elements of text, at sub-object level. Paijmans and Wubbens' paper showed, in other words, how their approach could allow users or readers to focus on particular kinds of content across texts, rather than on a conventional library framework defined by texts themselves.

Thinking in terms that reach beyond conventional library frameworks highlights a need to consider the process by which unstructured data becomes structured. This was the primary issue considered by Loretta Auvil from the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, who presented on the Software Environment for the Advancement of Scholarly Research (SEASR) project. This API-driven approach enables analyses run by text mining tools, such as NoraVis (http://www.noraproject.org/description.php) and Featurelens (http://www.cs.umd.edu/hcil/textvis/featurelens/) to be published to web services. This is critical: a VRE that is based on digital library infrastructure will have to include not just text, but software tools that allow users to analyse, retrieve (elements of) and search those texts in ever more sophisticated ways. This requires formal, documented and sharable workflows, and mirrors needs identified in the hard science communities, which are being met by initiatives such as the myExperiment project (http://www.myexperiment.org). A key priority of this project is to implement formal, yet sharable, workflows across different research domains. As different research domains have very different protocols for structuring and managing textual archives, the utility of being able to use tools such as Nora and Featurelens in a SEASR-type environment will become ever more important in the development of VREs for textual studies. For example, a numerical extraction system like that presented by the Open Boek project has significant utility when applied to archaeological reports, but such utility is clearly not confined to that domain. In the scientific communities, there has been interest in digital versions of lab books in VREs (http://www.vre.ox.ac.uk/ibvre/index.xml.ID=evaluation). Numeric data is likely to be critical to such exercises. Like Open Boek, the JISC-funded Integrative Biology VRE project was also concerned with the textual context of numbers: it found that digital recognition of equations was a significant problem, a clear case of crossover. Such analyses could, in theory, be delivered to the user by an architecture like that described by Auvil.

The final paper dealt with issues of the semantic annotation of texts within textual VREs. Kalina Boncheva presented the Generic Architecture for Text Engineering (GATE) project's work in Semantic Annotation (SA). SA is about adding meaningful structures to document resources. It is particularly useful for making computers communicate with each other more effectively. But not only computers benefit from SA. A key theme of the workshop was the well documented need researchers have to be able to annotate the texts upon which they are working: this is crucial to the research process. The Semantic Annotation Factory Environment (SAFE) by GATE will help annotators, language engineers and curators to deal with the (often tedious) work of SA, as it adds information extraction tools and other means to the annotation environment that make at least parts of the annotation process work automatically. This is known as a 'factory', as it will not completely substitute the manual annotation process, but rather complement it with the work of robots that help with the information extraction. According to GATE, "SAFE is a software suite and a methodology for the implementation and support of annotation factories. It is intended to provide a framework for commercial annotation services, supplied either as in-house units or as outsourced specialist activities" (http://www.gate.ac.uk).

The workshop focused on four areas that will become critical as the e-science agenda impacts on the established textual humanities. (1) There is a need to maintain an ambitious set of objectives for what the research communities need from a text-based VRE; this is driven by research needs and not technological capacity. Historically the humanities have had far less funding and (e-)infrastructure than the sciences, and researchers in the humanities dealing with digital texts have felt understandably inhibited by this. (2) A key challenge identified by the workshop is to recognize that, following generations of scholarship and practice, structuring collections, and even objects, is relatively easy. All that is required are the systematic metadata and repository policies long familiar to information and librarianship professionals. Yet what is far less simple is structuring the information contained within objects. This is central to the next stage of e-science scholarship in the textual humanities. (3) When tools and resources are linked using systematic web-service-based architectures, the benefits to researchers can far outstrip any initial outlay. The need for networks of tools linked to digital libraries so that different kinds of information can be extracted from unstructured text is essential. And (4), the need for scholars to be able to add their own content is critical. Although Web 2.0 has not revolutionized scholarly research in the way envisaged originally, researchers need to be able to annotate texts on which they are working, and to be able to store, search and structure those annotations. In a way, such a structure might resemble a (user-created) digital library within or across other digital libraries. Detailed semantic documentation of the links between the annotation and the annotated text is necessary, along with documentation of when, why and by whom the annotation was created. Furthermore, it would be highly desirable for any additional chunks from separate texts that may be relevant to the annotation (e.g., containing the same name, geographic reference, numeric data, etc.) to be identified: the workflow management architectures presented both by SEASR and GATE suggest this is possible.

References

Crane, G., Fuchs, B., Iorizzo, D. 2007: The Humanities in a Global e-Infrastructure: A Web-Services Shopping List. UK e-Science All Hands Meeting 2007, Nottingham, UK, September 2007.

Gates, B. 1996: The Road Ahead. Penguin Books, New York.

Paijmans, H. and Wubben S., 2007: Open Boek: a system for the extraction of numeric data from archeological reports. UK e-Science All Hands Meeting 2007, Nottingham, UK, September 2007.

Copyright © 2008 Stuart Dunn and Tobias Blanke
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | In Brief
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/january2008-dunn