4. Our Approach

So far most experiments in cross-language querying in digital libraries have employed a multilingual lexicon of some sort. As mentioned in the previous section, general purpose electronic dictionaries are generally inadequate for this scope as they tend to be lacking in necessary technical vocabulary. The disadvantages of multilingual thesauri include the fact that they are expensive to construct, they need continual maintenance and updating as new terms enter the vocabulary, and they require the use of a highly controlled vocabulary, which puts a heavy constraint on searching. On the other hand, the problem with most corpus-based cross-language systems is that the acquisition of a suitable set of relevant documents on which to train the retrieval system is extremely resource consuming.

At Pisa, we are now working on the design of a cross-language query system for a digital library containing documents in more than one language. We propose to implement a system which will integrate a dictionary/thesaurus-type search with a corpus-based strategy in which the corpus is extracted from the collection of documents contained in the digital library itself. The aim is to be able to match a query formulated in one language against documents stored in other languages, even when the query terms themselves are not included in the multilingual lexicon. With this approach, we hope to overcome some of the problems listed above. The implementation of the system is dependent on a number of the components of an integrated set of tools for mono- and bilingual lexicon and text processing, known as the PiSystem, which has been developed in Pisa.

4.1 Corpus-based Querying

Our corpus-based strategy is based on the concept of comparable corpora. Comparable corpora are sets of texts in pairs (or multiples) of languages with the same communicative function, i.e. generally on the same topic or domain. We first began to analyse corpora of this type from a linguistic perspective; they are sources of natural language lexical equivalences across languages and as such can provide much useful data for contrastive language studies. For this reason, language scholars are now beginning to acquire such collections of texts for particular linguistic or terminological studies. However, a digital library system which contains document archives on the same domain but in different languages is actually a real world implementation of the comparable corpora principle. We thus decided to adapt our comparable text system to meet the requirements of a multilingual digital library.

The corpus query system is based on the assumptions that (i) words acquire sense from their context, and (ii) words used in a similar way throughout a sub-language or special domain corpus will be semantically similar. It follows that, if it is possible to establish equivalences between several items contained in two different contexts, there is a high probability that the two contexts themselves are to some extent similar. We thus use lexical and linguistic knowledge extracted from a domain-specific corpus in one language and project it onto a comparable corpus in the other, i.e. given a particular term or set of terms in the texts in one language (L1), the aim is to be able to identify contexts which contain equivalent or related expressions in the texts of the other (L2). To do this, we attempt to isolate the vocabulary related to that term in the L1 corpus -- hypothesising that lexically equivalent terms will be associated with a similar vocabulary in L2. Here below, we give just a brief outline of how the system operates. For a more complete description, see Picchi and Peters (1996)[18].

For any term of interest, T, the system automatically constructs a context window containing T and up to 'n' lexically significant words (nouns, verbs and adjectives can be accepted) to the right and left of T; The value for 'n' can be varied. For each of these co-occurrences of T, morphological procedures identify the source lemma(s). The significance of the correlation between these items and T is then calculated using a statistical procedure. We are currently using Church and Hanks' Mutual Information Index (1990)[19] although we are also testing a different measure based on the likelihood ratio as formulated by Dunning (1993)[20]. The set of most significant collocates derived makes up the vocabulary, V1, that is considered to characterize our term T in this particular subdomain corpus. To exemplify what we mean by this, Figure 1 shows the 20 most significant collocates found in a set of comparable English and Italian documents for two Italian nouns: assistenza (assistance) and accordo (agreement). In the figure, the first column shows the MI value, and the number associated with the collocate gives its frequency value, i.e. the number of times the collocate was found in a context window where n=5.

420 assistenza (assistance) 801 accordo (agreement)
500.000 420|ASSISTENZA 500.000 801|ACCORDO|ACCORDARE
10.488 19|MEDICARE 10.408 5|INTERSTATALE
9.347 3|PRESTARE 8.554 22|CONCLUSO
9.326 12|LEGALE 8.483 7|STIPULARE
7.541 32|FORNIRE 7.161 8|FIRMARE
6.122 3|PROFUGO 6.793 10|ATTO
5.949 3|CONCEDERE 6.726 12|DERIVARE
5.784 11|SETTORE 6.573 3|ACCIAIO
5.439 3|FINANZIARE 6.539 10|BASARE
5.139 3|RUOLO 6.146 15|NUOVO
4.934 4|DESTINARE 6.141 3|ENTRARE
Figure 1: Significant collocates for assistenza and accordo

It is important to stress that these lists give words identified as the significant collocates for the two terms in this particular corpus; if the same terms appear in a corpus for a different type of sub-language, we would expect to find different collocates. Looking at this list, it can be seen that there is not a lot of noise; most of the terms given have a strong semantic relationship with the term being examined. For example, with assistenza we have associated adjectives meaning "sanitary", "legal", "financial", "humanitarian", and verbs such as "provide, "take refuge in", "give (help to)", and with accordo we find verbs such as "reach", "stipulate", "sign", "ratify", "conclude" and nouns like "act" or "document" (our test corpus has been extracted from a series of parliamentary debates). When there is more than one source lemma, all are listed.

Next, using our lexical resources (e.g. English/Italian morphological procedures, a bilingual lexical database), we construct an equivalent L2 vocabulary of translation equivalents (V2). Words or expressions that can be considered as lexically equivalent to our selected term in the L1 texts are then searched in the L2 corpus, i.e. we do this by searching for those contexts in L2 in which there is a significant presence of the L2 vocabulary for T. The significance is determined on the basis of a statistical procedure that assesses the probability for different sets of L2 cooccurrences to represent lexically equivalent contexts for T. The L2 contexts retrieved are written in a file and listed in descending order of relevance to our L1 term.

Figures 2.1 and 2.2 show examples of comparable contexts that have been found in the L2 corpus (English) for our two terms: assistenza and accordo, as characterized by the L1 corpus. The contexts are ordered in descending order of number of items from the V2 vocabulary, and the sum of the MI values associated with the items; the third column gives the sum of their frequency values, and the fourth gives the ranking of the context in the list of results. Direct translations of the term being searched are assigned an arbitrarily high MI value and thus, for the same number of V2 items, are listed before contexts which do not contain direct translations of the term. For example, in the set of contexts for accordo in Figure 2.1, contexts 2-5 include translation equivalents of accordo, 6-10 do not although they each contain the same number of V2 items. It can be seen that they still reflect the concept represented lexically by accordo even though they do not contain direct dictionary-derived translations.

Search for Comparable Contexts for ACCORDO
6 522.726 828 1) Commission *proposal* on transitional *arrangements* in *respect* of the *international* *textile* *agreement* 1. How does the Commission =FE"FXAC93086ENC.0035.01.00".11
5 520.973 827 2) the territory of a Member State illicitly. The *Council* *reached* a *political* *agreement* on these two *proposals* at its meeting on =FE"FXAC93297ENC.0010.01.00".40
5 519.782 825 3) in its proposal for two-year transitional *arrangements* in *respect* of the *international* *textile* *agreement* (uplift or maintenance =FE"FXAC93086ENC.0035.01.00".13
5 518.143 823 4) make this possible, the central European *countries* will *apply* Community competition *rules*. Europe *Agreements*, *signed* but not yet ratified =FE"FXAC93145ENC.0016.01.00".33
5 517.224 839 5) The reform of the *common* agricultural *policy* which was *agreed* by the *Council* of *Ministers* will have a major impact on both the economic =FE"FXAC93099ENC.0019.01.00".13
5 25.304 46 6) Therefore, all efforts which have to be made in *order* to *achieve* the *common* *objective* of an *area* without internal borders have to be intensified. =FE"FXAC93032ENC.0014.01.00".25
5 24.386 25 7) by successive meetings of the General Affairs *Council*. The *understanding* *reached* with the US *Trade* *Representative* on public procurement =FE"FXAC93264ENC.0034.02.00".48
5 22.262 43 8) recalls that the Treaty on European Union *foresees* that asylum *policy* should become a *matter* of *common* *interest* and, in a separate statement =FE"FXAC93101ENC.0039.02.00".26
5 22.188 35 9) Cooperation between sportsmen and women from *different* *countries* does a great *deal* to promote *international* *understanding*. Particularly for ... =FE"FXAC93145ENC.0043.03.00".18
5 21.295 23 10) GATT, making it possible to agree on negotiated *rules* to *clarify* the *issues* arising in the *international* *trade* and environment interface. =FE "FXAC93283ENC.0051.01.00".34
Figure 2.1: DBT (Comparable Corpus) - English Texts

In order to test the system, when retrieving the second set of contexts given in the Figure 2.2 for assistenza, we eliminated the direct translations given by our bilingual electronic dictionary ("assistance' and 'aid") from the L2 vocabulary for assistenza. However, in a number of cases (see contexts number 3 and 58, 59, 60) we still retrieve contexts that do contain these direct translations of assistenza , which suggests that the system is performing well. For reasons of space, we show just the first five results in order of ranking, and then numbers 57-60.

Search for Comparable Contexts for ASSISTENZA
4 25.653 45 1) the planning of return-home programmes, and are *leading* *roles* encouraged for women in *food* distributions in *refugee* camps?) \NOT\(1) Source: =FE "FXAC93333ENC.0013.01.00".18
4 25.615 60 2) Measures (SPS) texts, which specifically *address* measures taken for the *protection* of *health*, the *environment* or the consumer. The Commission ..... =FE "FXAC93333ENC.0003.03.00".23
4 22.390 23 3) Assistance in the form of Community loans and *grants* under the structural *fund* *programmes* operating in the *area*, notably the Regional Operational ... =FE"FXAC93283ENC.0017.01.00".34
4 20.364 16 4) through the Structural *Funds*, *programmes* developed to *protect* the *environment*. .. =FE "FXAC93065ENC.0011.01.00".51
4 20.364 16 5) biological depuration at Lixourion within a *funding* *programme* so as to *protect* the natural *environment* in the Gulf of Argostolion? 3. Will it provide =FE "FXAC93095ENC.0003.01.00".23
3 19.659 31 57) establishing a European food industry training *fund* to facilitate public *health* and consumer confidence in the *food* industry and which would ensure ... =FE "FXAC93065ENC.0022.01.00".13
3 19.369 38 58) assistance for better public administration *planning* and coordination (including the *health* *sector*). In addition, both the nutritional and sanitation .... =FE "FXAC93016ENC.0025.01.00".60
3 19.369 38 59) The Commission is currently implementing major *programmes* on AIDS in the *areas* of public *health*, research and assistance to developing countries. =FE "FXAC93137ENC.0005.02.00".30
3 19.369 38 60) humanitarian aid, welfare-related projects and *programmes* in *areas* such as *health* and education, and projects and programmes for rural development. =FE"FXAC93283ENC.0028.01.00".31
Figure 2.2: DBT (Comparable Corpus) - English Texts


This approach to the problem of identifying cross-language lexical equivalences over homogeneous sets of texts for different languages has several merits: it allows us to disambiguate, to a considerable extent, both the L1 term being searched and the target language terms provided by the dictionary; it permits us to retrieve lexically equivalent cross-language expressions even when the L2 context does not contain a dictionary derived translation of the L1 term; and it provides a ranking of our results.

Query Term Disambiguation: Although the problem of polysemy is greatly reduced in a domain specific corpus, it is still present -- to a varying degree depending on the type of texts being treated. The construction of the L1 vocabulary which characterizes our term T will permit us to obtain a clustering of the most relevant terms connected to T. If the corpus contains a predominant sense for the term then the vocabulary should represent this sense -- secondary senses that appear rarely will not cause a representative vocabulary of collocates to be constructed. If, in the corpus, there is more than one relevant sense for T then we would expect two or more distinct clusterings of significant collocates. For example, the Italian noun accordo has two distinct senses in our bilingual dictionary: the general sense which is translated by "agreement", and the very specific musical sense translated by "chord". Our corpus of parliamentary debates contained no examples of the second sense. However, if it had done, we would expect to obtain two distinct clusterings of significant collocates with little or no overlap. Thus, using this method, it is possible to distinguish between common technical terms which are used with different meanings in different scientific areas. Think, for example, of the different usages of "protocol" in the medical and software engineering domains. Very different sets of collocates would be constructed for the different acceptations of this term and thus searching for the appropriate sense would be facilitated.

Target Term Disambiguation: When constructing the L2 vocabulary of significant collocates for the L1 term being searched, our procedure takes as input all the translation equivalents listed in the bilingual dictionary, regardless of sense distinctions. Spurious or inappropriate translations are eliminated by the fact that we normally do not find them together with a significant number of items from the L2 vocabulary for the term being searched. This makes it possible for us to perform a sense disambiguation on the target terms proposed. For example, if we examine all the occurrences of the Italian noun sicurezza in our parliamentary corpus, we find that the sense is that of "safety", or "security" (one sense of "security" is a synonym of "safety"). This is confirmed by the set of significant collocates for this term; the top ten are the Italian equivalents of toy, hygiene, reactor, health, nuclear, maritime, council, road, provisions, Euratom. The bilingual dictionary gives us four separate senses for sicurezza translated by safety, security, certainty, confidence. On the English side of the corpus, we find 17 occurrences of "confidence" and just one of "certainty". However, the context for "certainty" does not appear in the list of comparable contexts for sicurezza as it contains no other L2 vocabulary items; and the contexts for "confidence" are ranked very low as they never contain more than two L2 significant collocates for sicurezza. Thus, our approach helps us to identify the correct sense of the target terms offered by the bilingual dictionary and to provide a ranking of the best L2 matches for the L1 term searched.

4.2 Implementing a Cross-Language DL Query System

As stated, when we began this work our main interest was linguistic, however, we now intend to integrate the corpus-based procedures with a multilingual lexicon in order to operate in a multilingual digital library system, retrieving documents from document bases in different languages rather than contexts from comparable text archives.

Queries will be translated by the multilingual lexicon but will also be expanded by applying the comparable-corpus based strategy in order to associate with each query term, not only its direct translations but also a vocabulary which defines its probable immediate context, in L1 and L2. In this way, we search for both pre-identified translation equivalents and also cross-language lexical equivalences. When the dictionary or lexicon offers no translation equivalent, the search for cross-language equivalent contexts is still possible. Documents retrieved are ranked with respect to (i) translation equivalents of query terms, (ii) statistical value assigned to associated significant collocates.

One of the limitations of this type of statistically-based querying over domain-specific archives is that it is only feasible when the text collection is sufficiently large and sufficiently homogenous to be able to derive a statistically meaningful set of collocates for the terms queried. These conditions should be satisfied by the average digital library.

It should be possible to extend the system to cover additional languages, providing the necessary lexical and morphological resources are available. Any language can be adopted as the starting point, much as is currently done in the construction of many multilingual thesauri where one language (usually English) acts as the base. The vocabulary associated with any term in the corpus for this language (V1) will then be translated into all the other languages (constructing Vn vocabularies). Each comparable set of documents will then be searched for contexts with a significant cooccurrence of lexical items from the relative target language vocabulary for T.

We also intend to test our system as a method for the semi-automatic construction of a thesaurus in a second language on the basis of an existing thesaurus in L1. In this case, given a set of comparable text archives in the appropriate domain, the system would be run for each term in the L1 thesaurus in order to retrieve corresponding L2 equivalent contexts. The terminologist could then select the relevant set of L2 (multiword) terms for each L1 item searched. Both the existing L1 thesaurus and the L2 thesaurus under construction could be enriched by automatically associating with each node of each side of the multilingual thesaurus all the significant collocates characterising that particular term.

Copyright © 1997 Carol Peters, Eugenio Picchi

Return to Sections 1-3
Go on to Section 5. References

D-Lib Magazine |  Current Issue |  Comments
Previous Story | Next Story