# Chapter 6. Exploring Vector Spaces in One and More Languages

This chapter is much easier than the last one. Having gone through the mathematics and (in a small example) the process of building word vectors, this chapter concentrates entirely on the reward for these achievements: being able to learn a lot about words themselves by exploring their vector spaces. The tour guide nature of this book is being given free-rein: you've put in (or adroitly skipped!) the hard mathematical labour to get to this point, and now you get to be a tourist and enjoy the view.

Not that this chapter is without new ideas. We'll talk about different ways of making the views more informative than a simple list of related words, by grouping words into clusters or plotting their vectors in a two-dimensional `word-spectrum.' We'll also see how to use documents from two languages to build a single vector space with words from both languages, which can (for example) be used to translate words and queries between languages.

While this hopefully makes a cohesive and readable story, you don't have to read it in sequence, and a perfectly acceptable way to approach this chapter is to flick through, see which of the pictures interest you, and delve into those sections to see how they were made. Or maybe even better, follow the links that I've highlighted at the beginning of sections and explore the models online for yourself. They say that "seeing is believing," and while the examples chosen for this chapter work especially well at highlighting particular points, there's no better way to convince yourself that the techniques we've described really do work in general than to use them for yourself. A few examples of your own will also give you a much better feel for what's going on, and with these under your belt, the mathematical methods described will make a lot more intuitive sense.

## Sections

### 1. Welcome to WORDSPACE

This section gives a short guided tour of the Infomap WORDSPACE demo. An example is given below:
 Keywords: NegativeKeywords:
 Add to query Subtract from query Term Similarity fire 1.000000 firefighters 0.743202 blaze 0.673635 cease 0.647473 fires 0.619991 flames 0.571786

### 2. Building a WORDSPACE

This section describes precisely how the WORDSPACE is built, using the freely available Infomap software. Today, you're better off using the SemanticVectors package. The WORDSPACE is very much like a term-document matrix as described in Chapter 5, with two main differences:
1. Instead of recording which terms occured in which documents, we record which terms occured near to frequent, important content-bearing terms
2. To compress this information into fewer dimensions, a technique from linear algebra called singular value decomposition is used. When applied to word-vectors, this technique is often called latent semantic indexing or latent semantic analysis.

### 3. Clustering related words into concept groups

So far we have only said how far neighbouring words are from a target query. In this section we describe the way in which clustering can be used to group these neighbours into more interesting contextual groups. These can often represent different usages and even different meanings of ambiguous words. An example using the word plant is given below.
 Keywords: NegativeKeywords:
 Cluster options (only if Clustered Results selected) Results: 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Clusters: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

 Prototypical Example Cluster Members plants plant plants abundant seeds fertiliser fungi algae aquatic containers poisonous vine temperate habitat arctic herbs ponds grow flora fruits insect medicinal pests radioactive chernobyl reactor reactors generating radioactive nuclear sellafield plutonium reprocessing uranium mw radiation greenpeace bnfl disposal radioactivity cegb waste environmentalists stations hydro scientist power sizewell chemicals chemicals toxic chemical organic contamination bacteria pesticides sewage hazards ici dioxide wastes biomass dioxide nitrogen greenhouse fuels sulphur fuel electricity carbon polluting emissions conserve fossil oxide pollutants foliage planting flowering flower buds planted foliage flowers evergreen seed shrubs bulbs winter stems seedlings compost spring factory machinery robots factory factories crops soil fertilisers nutrients vegetation crop harvested cultivation crops

### 4. Plotting Word-Senses in two dimensions

As well as clustering, you can often see different word-senses by plotting the neighbours of a word in a 2-dimensional plane that is the "plane of best fit" for the data. This is just like projecting points onto the line of best fit for 2-dimensional data, only we're projecting many more dimensions down onto a plane. The Word-Spectrum demo that showed this process was written by Scott Cederberg.

### 5. Mapping between two different WORDSPACES

By taking the neighbours of a word into account, you can translate its meaning much more effectively than by just shipping the query word itself from one model to another. For example, the neighbours of gas in the New York Times WORDSPACE are much more closely related to the neighbours of petroleum than to those of gas in the WORDSPACE built from the British National Corpus. This section gives a few useful examples of this process: there is no online demo as yet because this work was pretty bleeding-edge and works on a command-line interface, but is not yet demonstrated on the web.

### 6. Bilingual vector models

Suppose you want to translate not just between different models but also between different languages. In a few special cases, this can be done because we have a parallel corpus, that is, a document collection that exists in more the one language for which we know exactly which pairs of documents (or even sentences) in each language are translations of one another. From this, we can build a WORDSPACE that represents terms and documents in both languages, and can even represent queries using a combination of terms from each language. The example below shows the neighbours of the German word Knochen in both German and English, including its correct translation, the English word bone. The English word drug is also used as an example, showing that the bilingual WORDSPACE can be used to find translations that correspond to the different meanings of an ambiguous word.
 English Keywords: NegativeKeywords:
 German Keywords: NegativeKeywords:

 Add Subtract Term Similarity bone 0.822905 osteoinductive 0.669857 demineralized 0.600958 formation 0.598052 extracted 0.559023 trabeculae 0.544799
 Add Subtract Term Similarity knochen 1.000000 knochenneubildung 0.630791 bone 0.630306 allogenen 0.598935 knochens 0.593084 knochentransplantation 0.587973

### 7. Using WORDSPACE and class labelling to enrich a taxonomy

Finally, this section shows how WORDSPACE can be used to enrich a taxonomy. If we want to know where in a taxonomy to place an unknown word that appears in a corpus, we collect neighbours from WORDSPACE and then see where in the taxonomy these nighbours are concentrated using the class-labelling algorithm of Chapter 3. This enables us to estimate what hypernyms should be attached to the unknown word.
 Up to Geometry and Meaning | Back to Chapter 5 | On to Chapter 7