The third workshop of the DELOS working group, on the topic of "Cross-Language Information Retrieval", was hosted by ETH Zurich, 5-7 March 1997. DELOS is a working group funded by the IT Long Term Research programme of the European Commission to study and investigate existing and emerging technologies and issues relevant to digital libraries.
The DELOS Working Group is just one of a series of ERCIM-sponsored initiatives aimed at promoting research and operational activities in the Digital Library field. The DELOS consortium consists mainly of members of ERCIM institutes (ERCIM: European Research Consortium for Informatics and Mathematics).
As was borne out by many of the workshop presentations, many research projects addressing issues of digital information repositories in Europe must deal with information in several languages, even when multi-lingual or cross-language information retrieval is not a central theme of the project. We distinguish "multi-lingual" information retrieval as involving several languages, though a user's search query is always evaluated against only those documents in the query language, and "cross-language" information retrieval as the case where a user's query may retrieve documents in languages other than the language of the query.
A total of 27 participants attended the workshop, representing 9 different European countries, as well as invited speakers from the United States and Korea, who helped to broaden the discussions beyond the European perspective. Apart from the geographical diversity of the participants, backgrounds in Information Retrieval, Computational Linguistics, Lexicography, Controlled Vocabulary Thesauri, and Internet Technology, also helped to bring many different perspectives to the discussions of the work presented.
To set the scene for the workshop, Doug Oard of the University of Maryland gave a comprehensive overview of Cross-Language Information Retrieval in the USA, including a useful schematic breakdown of the various approaches: corpus-based (parallel, comparable or unaligned corpora) versus knowledge based (dictionaries or ontologies). He presented a substantial amount of US-based research on cross-language retrieval, and showed that current approaches have demonstrated performance in the range of 50% to 75% of the performance of the comparable monolingual retrieval task. This presentation was followed by Sung-Huyn Myaeng of the National University Taejon, Korea, who gave an indepth presentation of the particular problems of working with Asian languages, including the use of different scripts, the problem of word segmentation and the similar problem of compound noun analysis. This was appropriately followed by Martin Duerst, University of Zurich, who, in recognition of the increasing role of the World Wide Web in this area of research, detailed the emerging HTTP and HTML standards for supporting multi-script and multi-language information on the World Wide Web.
Other presentations from European researchers focussed on the approaches being adopted for cross-language and multi-language retrieval in various projects such as Twenty-One, MULINEX, Acquarelle, ILIAD and MedExplore, some of which are funded by the European Commission. A common sentiment expressed was that, even in cases where multilinguality was not a core concern of the project consortia, it was a topic that had to be addressed given the European dimension. We therefore saw some novel approaches to cross-language retrieval being taken by these researchers. An important parallel theme was also the identification, conflation and use of multi-word terms for cross-language retrieval, given the observation that these can serve to greatly reduce translation ambiguities.
From the Information Retrieval point of view, David Hull of Rank Xerox research centre, Grenoble, presented a model for weighted Boolean retrieval for cross-language retrieval, and Paraic Sheridan of ETH Zurich presented a method of using a retrieval model for building information structures called "similarity thesauri" for cross-language retrieval. The presentation of similarity thesauri showed how this approach has been implemented also for cross-language retrieval of speech documents, and a demonstration of the EuroSpider retrieval system was given. Approaches from the Computational Linguistics perspective were presented by Carol Peters of CNR Italy, who showed how the use of comparable corpora together with lexical resources could bring to light useful translation equivalences for cross-langauge retrieval, and Piek Vossen of the University of Amsterdam presented the EuroWordnet project which is augmenting the Princeton Wordnet of English with wordnets in Dutch, Italian and Spanish. The workshop concluded with a discussion of the important issue of evaluating different approaches to cross-language information retrieval, and the fact that this year's Text Retrieval Conference (TREC 6) will include a track evaluating cross-language retrieval was highlighted as highly significant.
Further information on this workshop, including a list of participants and abstracts of presentations, can be found at:
The next DELOS workshop will address "Multi-Media Indexing and Retrieval", and will take place in Pisa Italy, August 29th and 30th, in conjunction with the First European Conference on Research and Advanced Technology for Digital Libraries.
For additional information, please contact:
The DELOS Working Group Coordinator
Instituto di Elaborazione della Informazione
Consiglio Nazionale delle Ricerche
Tel +39 50 593429
The workshop is sponsored by the Federal Large Scale Networking Working Group (LSN) of the National Science and Technology Council's Committee on Computing, Information, and Communications R&D Subcommittee. LSN members include the National Institutes of Health, National Security Agency, Department of Energy, National Aeronautics and Space Administration, Department of Defense, DARPA, National Coordinating Office, National Oceanic and Atmospheric Administration, White House Office of Science and Technology Policy, Federal Networking Council, and National Science Foundation.
|8th Joint European Networking Conference (JENC8), Edinburgh, Scotland, May 12-15, 1997||http://www.terena.nl/conf/JENC8.html|
|American Society for Information Science (ASIS) 1997 Mid-Year Meeting: Information Privacy, Security, and Data Integrity, Scottsdale, Arizona, May 30 - June 3, 1997||http://www.asis.org/midyear97/program.html|
|"Digital Documents in Context: Organization and Creation", Thirty-First Annual Hawaii International Conference on Systems Sciences (HICSS)||http://www.cba.hawaii.edu/hicss|
|Evaluating Web Sites for Educational Uses: Bibliography Checklist, Carolyn M. Kotlas, February 13, 1997||http://www.iat.unc.edu/guides/irg-49.html|
|Gabriel, Gateway to Europe's National Libraries||http://www.konbib.nl/gabriel/|
|Human-Computer Interaction Laboratory, University of Maryland Institute of Advanced Computer Studies, |
14th Annual Symposium and Open House
May 30, 1997
|IEEE ADL '97: International Conference on Advances in Digital Libraries|
Washington, DC, May 7-9, 1997
|International Association for Social Science Information Service and Technology (IASSIST)/International Federation of Data Organizations (IFDO) Annual Conference, Odense, Denmark, May 6-9, 1997||http://www.sa.dk/dda/conf97|
|International Summer School on the Digital Library, Tilburg University, the Netherlands, August 10- 22, 1997||http://cwis.kub.nl/~ticer/|
|Networking '97: Exploring the Continued Evolution of Internet Technology for Research and Education, Washington, DC, April 9-10, 1997||http://www.educom.edu/web/nttf/net97.html|
|Oregon State System of Higher Education Historical and Cultural Atlas Resource||http://darkwing.uoregon.edu/~atlas/|
|Twenty-fifth Annual Telecommunications Policy Research Conference||http://www.si.umich.edu/~prie/tprc|