Mirrors-Web

A Demonstrator for Automatic Thesaurus Derivation

User's Guide
(c) Helge Dyvik & Paul Meurer


Introduction

Mirrors-Web is a web interface to Semantic Mirrors, a program deriving thesaurus entries from data extracted from parallel corpora (see the Wordnet project). An explanation of the method can be found in the second part of the paper "Translations as a Semantic Knowledge Source" (henceforth "MirrorsPaper") by Helge Dyvik. Most of the input bases, consisting of words with their sets of possible translations, have been taken from The English-Norwegian Parallel Corpus (ENPC) .

 

The web interface is created by Paul Meurer, who has also re-implemented Helge Dyvik's Mirrors program.
 

 

 

Main links

Derive Thesaurus Entries
 

Derive Thesaurus Lists
 

The Wordnet Project

MirrorsPaper


Search

Either single thesaurus entries and lattices, or complete thesaurus listings, may be derived.

Single Thesaurus Entries and Lattices

The user may choose a word base (see below), which consists of word lemmas associated with their sets of translations in the other language. The word base is the input to the Mirrors procedure. 

The user searches for a word from the base (in one of three ways; see immediately below), and its thesaurus entry is generated, divided into senses and subsenses, with hyperonyms, hyponyms, synonyms and related words, to the extent that sufficient information is available, and according to the parameters extended, Synset Limit and Overlap Threshold, explained below. Some of the translations of each sense (not necessarily all of them) are given in parentheses. Furthermore, lattices showing the network of semantic relations among the senses graphically can be inspected (see "Lattices" below). Search can be performed in three alternative ways.

  • Choose the appropriate language. Write a word in the search window (remove the asterisk, if any!) and click "Search" or hit return. Or:
  • Click one of the the translations, hyperonyms, synonyms etc. in the entry displayed to see its entry. Or:
  • Choose the appropriate language. Click the option "Go to the list of words in base X". This displays a simple list of all the available words in the chosen language in the current base. The words can be clicked to see their entries. (NB! This option is time-consuming for the larger bases.)

Complete Thesaurus Listings


The user may specify one or more bases to have their derived thesaurus entries listed in one document, with the entries in alphabetical order. For each base a category may be specified, to identify the source of the entries in the merged list. Thus, specifying "a" after AJfull and "v" after Vfull will lead to the addition of "(a)" and "(v)" after the respective entries. The number of entries to be listed may also be specified. It is recommended that large bases are listed a few hundred entries at a time. Furthermore the same parameters as in the case of single entries may be specified: Extended bases, Language, Synsetlimit and Overlapthreshold. Finally the user may choose to exclude entries and senses where no hyperonyms, hyponyms, synonyms or related words have been found, and also "starred" entries, which is relevant for bases with only partial information from the corpus. The starred entries are entries for which the bases only contain limited information.


 
 

Word Bases

The word bases currently available are listed below, where ENPC-Adj, ENPC-N and ENPC-V (adjectives, nouns and verbs, respectively) are based on automatic word alignment of the corpus by means of Sindre Sørensen's word alignment program, base-0102, gkverbs2 and rett on manually extracted translational correspondences in the corpus, and god2dict, handle2dict and rett2dict on automatically extracted correspondences from a bilingual dictionary. The bases enggreekmerged and enggreeknohapax contain manually extracted noun correspondenced from an English-Greek corpus. (For space reasons, not all the bases are accessible at all times.) The first three bases, being based on automatic alignment, reflect the shortcomings of the morphological tagging of the corpus and the preliminary status of the automatic word alignment, which has an estimated precision of .84 and an estimated recall of .62. Hence, there is much "noise" in those bases, but there are also encouraging properties.

<><>

ENPC-Adj
N: 4308 entries
E: 4003 entries
Automatically extracted adjectives. The corpus has been automatically word-aligned (see above), and then all adjectives with their sets of translations in the corpus, limited to adjectives, have been extracted.

ENPC-N
N: 21153 entries
E: 13344 entries
Automatically extracted nouns. The corpus has been automatically word-aligned (see above), and then all nouns with their sets of translations in the corpus, limited to nouns, have been extracted.

ENPC-V
N: 3043 entries
E: 2983 entries
Automatically extracted verbs. The corpus has been automatically word-aligned (see above), and then all verbs with their sets of translations in the corpus, limited to verbs, have been extracted. 

base-0102:
N: 2796 entries
E: 724 entries
Manually extracted translation correspondences starting from the Norw. adjective 'god' and the nouns 'tak' and 'selskap' and going back and forth 4 times (extracted by Helge Dyvik and Martha Thunes).

enggreekmerged
G: 775 entries
E: 1882 entries
Manually extracted nouns from a Greek-English parallel corpus, done by Marianna Apidianaki.

enggreeknohapax
G: 581 entries
E: 1435 entries
The same as enggreekmerged above, except that correspondences that occur only once in the corpus have been discarded.

gkverbs2:
N: 2698 entries
E: 1036 entries
Manually extracted translation correspondences comprising verbs, starting from the Norw. verbs 'handle' and 'ta' (extracted by Gunn Inger Lyse and Kjersti Drøsdal Vikøren). 

rett:
N: 208 entries
E: 1280 entries
Manually extracted translation correspondences (for a masters thesis by Gunn Inger Lyse) starting from the Norw. noun and adjective 'rett' (entered as "rettN" and "rettA" respectively). 

god2dict:
N: 7185 entries
E: 2725 entries
This base is not taken from the ENPC corpus: It is based on automatically extracted translation correspondences from Kunnskapsforlaget's big English-Norwegian/Norwegian-English dictionary (c. 217.000 entries) (in connection with a masters thesis by Karen Halling), starting with the adjective 'god' and going back and forth 5 times. Sense divisions in the dictionary have been disregarded: for each dictionary entry the union of its translations across senses has been extracted. Thus, the sense divisions in the thesaurus entries are derived by the Mirrors method. 

handle2dict:
N:  4701 entries
E:  593 entries
This base, in the same way as god2dict, is based on automatically extracted translation correspondences from Kunnskapsforlaget's English-Norwegian/Norwegian-English dictionary, starting with the verb 'handle' and going back and forth 5 times.

rett2dict:
N: 7670 entries
E: 1252 entries
This base, in the same way as god2dict, is based on automatically extracted translation correspondences from Kunnskapsforlaget's English-Norwegian/Norwegian-English dictionary, starting with the noun 'rett' and going back and forth 5 times.

Do you want to analyse your own translational word base? Contact Helge Dyvik

Extended bases

The user may choose to operate on automatically 'extended' versions of the bases. This concept is explained here (in Norwegian). Extended bases generally (but not always) give the best results.

Synset Limit

Synset Limit can be set. This concept is explained on p. 16 of MirrorsPaper. It influences the division of related words into 'synonyms', 'related words', 'hyperonyms' and 'hyponyms'. In general, the higher the Synset Limit, the larger the number of synonyms and related words, and  the smaller the number of hyperonyms and hyponyms (i.e., a "flatter" hierarchy). If no value is given, Synset Limit is calculated automatically on the basis of the size of the semantic fields (1/4 of the number of senses in the field, but maximum 20 and minimum 5). The value should be a whole number, 20 being a useful default. The range 3 - 25 is recommended.

Overlap Threshold

Overlap Threshold can be set. This concept is explained on p. 17 of MirrorsPaper. It influences the granularity of the division of each sense into mutually related subsenses. The value should be a number between 0 and 1, with 0.05 as a recommended default. In general, the higher the Overlap Threshold, the more subsenses.

Lattices

Below the headline of each Sense in the entry, the option Lattice can be chosen. This concept is explained on p. 12 ff. of MirrorsPaper. Sets of features have automatically been assigned to the words in the bases based on the network of translational relations, and the feature sets form lattice structures based on inclusion and overlap among the sets. According to the hypothesis this lattice structure expresses semantic relations like hypero- and hyponymy, and semantic closeness is assumed to be reflected in closeness in the lattice. The thesaurus entries have been derived from the information in the lattices.
    By clicking on Lattice part of the lattice structure surrounding the word sense is displayed: All nodes dominated by the word sense itself, and all its mother and grandmother nodes, with the sublattices dominated by them. In web browsers with the appropriate software it is possible to browse further parts of the lattice by clicking on nodes to see the sublattices similarly associated with them.

Enjoy!

 Questions and comments to Helge Dyvik