Chapter 6. Exploring Vector Spaces in One and More Languages
This chapter is much easier than the last one. Having gone
through the mathematics and (in a small example) the process of
building word vectors, this chapter concentrates entirely on the
reward for these achievements: being able to learn a lot about words
themselves by exploring their vector spaces. The tour guide
nature of this book is being given free-rein: you've put in (or
adroitly skipped!) the hard mathematical labour to get to this point,
and now you get to be a tourist and enjoy the view.
Not that this chapter is without new ideas. We'll talk about different
ways of making the views more informative than a simple list of
related words, by grouping words into clusters or plotting their
vectors in a two-dimensional `word-spectrum.' We'll also see how to use
documents from two languages to build a single vector space with words
from both languages, which can (for example) be used to
translate words and queries between languages.
While this hopefully makes a cohesive and readable story, you don't
have to read it in sequence, and a perfectly acceptable way to approach
this chapter is to flick through, see which of the pictures interest
you, and delve into those sections to see how they were made. Or maybe
even better, follow the links that I've highlighted at the beginning
of sections and explore the models online for yourself. They say that
"seeing is believing," and while the examples chosen for this
chapter work especially well at highlighting particular points,
there's no better way to convince yourself that the techniques we've
described really do work in general than to use them for yourself. A
few examples of your own will also give you a much better feel for
what's going on, and with these under your belt, the mathematical
methods described will make a lot more intuitive sense.
Sections
1. Welcome to WORDSPACE
This section gives a short guided tour of the Infomap WORDSPACE
demo. An example is given below:
Keywords:
Negative Keywords:
Add to query
Subtract from query
Term
Similarity
fire
1.000000
firefighters
0.743202
blaze
0.673635
cease
0.647473
fires
0.619991
flames
0.571786
2. Building a WORDSPACE
This section describes precisely how the WORDSPACE is built, using the
freely available Infomap
software. Today, you're better off using
the SemanticVectors
package.
The WORDSPACE is very much like a term-document matrix as described in
Chapter 5, with two main differences:
Instead of recording which terms occured in which documents, we
record which terms occured near to frequent, important
content-bearing terms
To compress this information into fewer dimensions, a technique
from linear algebra called singular value decomposition is
used. When applied to word-vectors, this technique is often called
latent semantic indexing or latent semantic
analysis.
3. Clustering related words into concept groups
So far we have only said how far neighbouring words are from a target
query. In this section we describe the way in which clustering
can be used to group these neighbours into more interesting contextual
groups. These can often represent different usages and even different
meanings of ambiguous words. An example using the word plant is
given below.
Keywords:
Negative Keywords:
Cluster options (only if Clustered Results selected)
As well as clustering, you can often see different word-senses by
plotting the neighbours of a word in a 2-dimensional plane that is the
"plane of best fit" for the data. This is just like projecting points
onto the line of best fit for 2-dimensional data, only we're
projecting many more dimensions down onto a plane.
The Word-Spectrum demo that showed this process was written by Scott
Cederberg.
5. Mapping between two different WORDSPACES
By taking the neighbours of a word into account, you can translate its
meaning much more effectively than by just shipping the query word
itself from one model to another. For example, the neighbours of
gas in the New York Times WORDSPACE are much more closely
related to the neighbours of petroleum than to those of
gas in the WORDSPACE built from the British National Corpus.
This section gives a few useful examples of this process: there is no
online demo as yet because this work was pretty bleeding-edge and
works on a command-line interface, but is not yet demonstrated on the
web.
6. Bilingual vector models
Suppose you want to translate not just between different models but
also between different languages. In a few special cases, this can be
done because we have a parallel corpus, that is, a document
collection that exists in more the one language for which we know
exactly which pairs of documents (or even sentences) in each language
are translations of one another. From this, we can build a WORDSPACE
that represents terms and documents in both languages, and can even
represent queries using a combination of terms from each language.
The example below shows the neighbours of the German word
Knochen in both German and English, including its correct
translation, the English word bone. The English word
drug is also used as an example, showing that the bilingual
WORDSPACE can be used to find translations that correspond to the
different meanings of an ambiguous word.
English
Keywords:
Negative Keywords:
German
Keywords:
Negative Keywords:
Add
Subtract
Term
Similarity
bone
0.822905
osteoinductive
0.669857
demineralized
0.600958
formation
0.598052
extracted
0.559023
trabeculae
0.544799
Add
Subtract
Term
Similarity
knochen
1.000000
knochenneubildung
0.630791
bone
0.630306
allogenen
0.598935
knochens
0.593084
knochentransplantation
0.587973
7. Using WORDSPACE and class labelling to enrich a taxonomy
Finally, this section shows how WORDSPACE can be used to enrich a
taxonomy. If we want to know where in a taxonomy to place an unknown
word that appears in a corpus, we collect neighbours from WORDSPACE
and then see where in the taxonomy these nighbours are concentrated
using the class-labelling algorithm of Chapter 3. This enables us to
estimate what hypernyms should be attached to the unknown word.