Chapter 6. Exploring Vector Spaces in One and More Languages

This chapter is much easier than the last one. Having gone through the mathematics and (in a small example) the process of building word vectors, this chapter concentrates entirely on the reward for these achievements: being able to learn a lot about words themselves by exploring their vector spaces. The tour guide nature of this book is being given free-rein: you've put in (or adroitly skipped!) the hard mathematical labour to get to this point, and now you get to be a tourist and enjoy the view.

Not that this chapter is without new ideas. We'll talk about different ways of making the views more informative than a simple list of related words, by grouping words into clusters or plotting their vectors in a two-dimensional `word-spectrum.' We'll also see how to use documents from two languages to build a single vector space with words from both languages, which can (for example) be used to translate words and queries between languages.

While this hopefully makes a cohesive and readable story, you don't have to read it in sequence, and a perfectly acceptable way to approach this chapter is to flick through, see which of the pictures interest you, and delve into those sections to see how they were made. Or maybe even better, follow the links that I've highlighted at the beginning of sections and explore the models online for yourself. They say that "seeing is believing," and while the examples chosen for this chapter work especially well at highlighting particular points, there's no better way to convince yourself that the techniques we've described really do work in general than to use them for yourself. A few examples of your own will also give you a much better feel for what's going on, and with these under your belt, the mathematical methods described will make a lot more intuitive sense.

Sections

1. Welcome to WORDSPACE

This section gives a short guided tour of the Infomap WORDSPACE demo. An example is given below:

Keywords:
Negative Keywords:

Add to query	Subtract from query	Term	Similarity
		fire	1.000000
		firefighters	0.743202
		blaze	0.673635
		cease	0.647473
		fires	0.619991
		flames	0.571786

2. Building a WORDSPACE

This section describes precisely how the WORDSPACE is built, using the freely available Infomap software. Today, you're better off using the SemanticVectors package. The WORDSPACE is very much like a term-document matrix as described in Chapter 5, with two main differences:

Instead of recording which terms occured in which documents, we record which terms occured near to frequent, important content-bearing terms
To compress this information into fewer dimensions, a technique from linear algebra called singular value decomposition is used. When applied to word-vectors, this technique is often called latent semantic indexing or latent semantic analysis.

3. Clustering related words into concept groups

So far we have only said how far neighbouring words are from a target query. In this section we describe the way in which clustering can be used to group these neighbours into more interesting contextual groups. These can often represent different usages and even different meanings of ambiguous words. An example using the word plant is given below.

Keywords:
Negative Keywords:

Cluster options
(only if Clustered Results selected)

Results:

Clusters:

Prototypical Example	Cluster Members
plants	plant plants abundant seeds fertiliser fungi algae aquatic containers poisonous vine temperate habitat arctic herbs ponds grow flora fruits insect medicinal pests
radioactive	chernobyl reactor reactors generating radioactive nuclear sellafield plutonium reprocessing uranium mw radiation greenpeace bnfl disposal radioactivity cegb waste environmentalists stations hydro scientist power sizewell
chemicals	chemicals toxic chemical organic contamination bacteria pesticides sewage hazards ici
dioxide	wastes biomass dioxide nitrogen greenhouse fuels sulphur fuel electricity carbon polluting emissions conserve fossil oxide pollutants
foliage	planting flowering flower buds planted foliage flowers evergreen seed shrubs bulbs winter stems seedlings compost spring
factory	machinery robots factory factories
crops	soil fertilisers nutrients vegetation crop harvested cultivation crops

4. Plotting Word-Senses in two dimensions

As well as clustering, you can often see different word-senses by plotting the neighbours of a word in a 2-dimensional plane that is the "plane of best fit" for the data. This is just like projecting points onto the line of best fit for 2-dimensional data, only we're projecting many more dimensions down onto a plane. The Word-Spectrum demo that showed this process was written by Scott Cederberg.

5. Mapping between two different WORDSPACES

By taking the neighbours of a word into account, you can translate its meaning much more effectively than by just shipping the query word itself from one model to another. For example, the neighbours of gas in the New York Times WORDSPACE are much more closely related to the neighbours of petroleum than to those of gas in the WORDSPACE built from the British National Corpus. This section gives a few useful examples of this process: there is no online demo as yet because this work was pretty bleeding-edge and works on a command-line interface, but is not yet demonstrated on the web.

6. Bilingual vector models

Suppose you want to translate not just between different models but also between different languages. In a few special cases, this can be done because we have a parallel corpus, that is, a document collection that exists in more the one language for which we know exactly which pairs of documents (or even sentences) in each language are translations of one another. From this, we can build a WORDSPACE that represents terms and documents in both languages, and can even represent queries using a combination of terms from each language. The example below shows the neighbours of the German word Knochen in both German and English, including its correct translation, the English word bone. The English word drug is also used as an example, showing that the bilingual WORDSPACE can be used to find translations that correspond to the different meanings of an ambiguous word.

English
Keywords:
Negative Keywords:

German
Keywords:
Negative Keywords:

Add	Subtract	Term	Similarity
		bone	0.822905
		osteoinductive	0.669857
		demineralized	0.600958
		formation	0.598052
		extracted	0.559023
		trabeculae	0.544799

Add	Subtract	Term	Similarity
		knochen	1.000000
		knochenneubildung	0.630791
		bone	0.630306
		allogenen	0.598935
		knochens	0.593084
		knochentransplantation	0.587973

7. Using WORDSPACE and class labelling to enrich a taxonomy

Finally, this section shows how WORDSPACE can be used to enrich a taxonomy. If we want to know where in a taxonomy to place an unknown word that appears in a corpus, we collect neighbours from WORDSPACE and then see where in the taxonomy these nighbours are concentrated using the class-labelling algorithm of Chapter 3. This enables us to estimate what hypernyms should be attached to the unknown word.

Up to Geometry and Meaning

Back to Chapter 5

On to Chapter 7