Mirrors-Web
A Demonstrator for
Automatic Thesaurus Derivation
User's
Guide
|
|
|
The user may specify one or more bases to have their derived thesaurus entries listed in one document, with the entries in alphabetical order. For each base a category may be specified, to identify the source of the entries in the merged list. Thus, specifying "a" after AJfull and "v" after Vfull will lead to the addition of "(a)" and "(v)" after the respective entries. The number of entries to be listed may also be specified. It is recommended that large bases are listed a few hundred entries at a time. Furthermore the same parameters as in the case of single entries may be specified: Extended bases, Language, Synsetlimit and Overlapthreshold. Finally the user may choose to exclude entries and senses where no hyperonyms, hyponyms, synonyms or related words have been found, and also "starred" entries, which is relevant for bases with only partial information from the corpus. The starred entries are entries for which the bases only contain limited information. |
The word bases currently available are
listed below, where ENPC-Adj, ENPC-N and ENPC-V
(adjectives, nouns and verbs, respectively) are based on automatic word
alignment of the
corpus by means of Sindre Sørensen's word alignment program, base-0102, gkverbs2 and rett on manually extracted
translational correspondences in the corpus, and god2dict,
handle2dict
and rett2dict on automatically extracted correspondences from a
bilingual dictionary. The bases enggreekmerged
and enggreeknohapax contain
manually extracted noun correspondenced from an English-Greek corpus.
(For space reasons, not all the bases are
accessible at all times.) The
first three bases, being based on automatic alignment, reflect the
shortcomings of the morphological
tagging of the corpus and the preliminary status of the automatic word
alignment, which has an estimated precision of .84 and an estimated
recall of .62. Hence, there is much "noise" in those bases, but there
are also encouraging properties.
ENPC-Adj
N: 4308 entries
E: 4003 entries
Automatically
extracted adjectives. The corpus has been automatically word-aligned
(see
above), and then all adjectives with their sets of translations in the
corpus, limited to adjectives, have been extracted.
ENPC-N
N: 21153 entries
E: 13344 entries
Automatically
extracted nouns. The corpus has been automatically word-aligned (see
above), and then all nouns with their sets of translations in the
corpus, limited to nouns, have been extracted.
ENPC-V
N: 3043 entries
E: 2983 entries
Automatically
extracted verbs. The corpus has been automatically word-aligned (see
above), and then all verbs with their sets of translations in the
corpus, limited to verbs, have been extracted.
base-0102:
N: 2796 entries
E: 724 entries
Manually
extracted translation correspondences starting from the Norw. adjective
'god' and the nouns 'tak' and 'selskap' and going back and forth 4
times (extracted by Helge Dyvik and Martha Thunes).
enggreekmerged
G: 775 entries
E: 1882 entries
Manually
extracted nouns from a Greek-English parallel corpus, done by Marianna
Apidianaki.
enggreeknohapax
G: 581 entries
E: 1435 entries
The same
as enggreekmerged above,
except that correspondences that occur only once in the corpus have
been discarded.
gkverbs2:
N: 2698 entries
E: 1036 entries
Manually
extracted translation correspondences comprising verbs, starting from
the Norw. verbs 'handle' and 'ta' (extracted by Gunn Inger Lyse and
Kjersti Drøsdal Vikøren).
rett:
N: 208 entries
E: 1280 entries
Manually
extracted translation correspondences (for a masters thesis by Gunn
Inger Lyse) starting from the Norw. noun and adjective 'rett' (entered
as "rettN" and "rettA" respectively).
god2dict:
N: 7185 entries
E: 2725 entries
This base
is not taken from the ENPC corpus: It is based on automatically
extracted translation correspondences from Kunnskapsforlaget's big
English-Norwegian/Norwegian-English dictionary (c. 217.000 entries) (in
connection with a masters thesis by Karen Halling), starting with the
adjective 'god' and going back and forth 5 times. Sense divisions in
the dictionary have been disregarded: for each dictionary entry the
union of its translations across senses has been extracted. Thus, the
sense divisions in the thesaurus entries are derived by the Mirrors
method.
handle2dict:
N: 4701 entries
E: 593 entries
This
base, in the same way as god2dict, is based on automatically extracted
translation correspondences from Kunnskapsforlaget's
English-Norwegian/Norwegian-English dictionary, starting with the verb
'handle' and going back and forth 5 times.
rett2dict:
N: 7670 entries
E: 1252 entries
This
base, in the same way as god2dict, is based on automatically extracted
translation correspondences from Kunnskapsforlaget's
English-Norwegian/Norwegian-English dictionary, starting with the noun
'rett' and going back and forth 5 times.
Do you want to analyse your own translational word
base?
Contact Helge Dyvik
The user may choose to
operate on automatically 'extended' versions of the bases. This concept
is explained here (in Norwegian). Extended
bases generally (but not always) give the best results.
Synset Limit can be set. This concept is
explained on p. 16 of MirrorsPaper. It influences the
division of related words into 'synonyms', 'related words',
'hyperonyms' and 'hyponyms'. In general, the higher the Synset Limit,
the larger the number of synonyms and related words, and the
smaller the number of hyperonyms and hyponyms (i.e., a "flatter"
hierarchy). If no value is given, Synset Limit is calculated
automatically on the basis of the size of the semantic fields (1/4 of
the number of senses in the field, but maximum 20 and minimum 5). The
value should be a whole number, 20 being a useful default. The range 3
- 25 is recommended.
Overlap Threshold can be set. This concept
is explained on p. 17 of MirrorsPaper. It influences the
granularity of the division of each sense into mutually related
subsenses. The value should be a number between 0 and 1, with 0.05 as a
recommended default. In general, the higher the Overlap Threshold, the
more subsenses.
Below the headline of each Sense in the entry, the option Lattice can be chosen. This
concept is explained on p. 12 ff. of MirrorsPaper. Sets of features have
automatically been assigned to the words in the bases based on the
network of translational relations, and the feature sets form lattice
structures based on inclusion and overlap among the sets. According to
the hypothesis this lattice structure expresses semantic relations like
hypero- and hyponymy, and semantic closeness is assumed to be reflected
in closeness in the lattice. The thesaurus entries have been derived
from the information in the lattices.
By clicking on Lattice part of the lattice structure surrounding the word sense
is displayed: All nodes dominated by the word sense itself, and all its
mother and grandmother nodes, with the sublattices dominated by them.
In web browsers with the appropriate software it is possible to browse
further parts of the lattice by clicking on nodes to see the
sublattices similarly associated with them.
Questions
and comments to Helge Dyvik