Chapter 5. Word-Vectors and Search Engines

This chapter was used as teaching material for the 2003 Computational Word Learning course at Stanford. You can therefore download this version as a sample chapter.

So far in the book we have discussed symmetric and antisymmetric relationships between particular words in a graph or a hierarchy, described one way to learn symmetric relationships from text, and shown how to use ideas such as similarity measures and transitivity to find nearest neighbours of a particular word. But ideally we should be able to measure the similarity or distance between any pair of words or concepts. To some extent, this is possible in graphs and taxonomies by finding the lengths of paths between concepts, but there are at least three problems with this. First of all, finding shortest paths is often computationally expensive and may take a long time. Secondly, we may not have a reliable taxonomy, and as we've seen already, the fact that there is a short path between two words in a graph doesn't necessarily mean that they're very similar, because the links in this short path may have arisen from very different contexts. Thirdly, the meanings of words we encounter in documents and corpora may be very different from those given by a general taxonomy such as WordNet --- for example, WordNet 2.0 only gives the fruit and tree meanings for the word apple, which is a decided contrast with the top 10 pages currently returned by the Google search engine when doing an internet search with the query apple, which are all about Apple Computers.

Another limitation of our methods so far is that we have focussed our attention purely on individual concepts, mainly single words. Ideally, we should be able to find the similarity between two collections of words, and quickly. For this, we need some process for semantic composition --- working out how to represent the meanings of a sentence or document based on the meaning of the words it contains. This all sounds like a pretty tall order, but (to some extent) it's actually been possible for years and you will probably have encountered such a system many times --- it's precisely what a search engine does.

The traditional ways in which a search engine accomplishes these tasks are almost entirely mathematical rather than linguistic, and are based upon numbers rather than meanings. Using its distribution in a corpus, the meaning of each word is represented by a characteristic list of numbers, and the numbers representing a whole document are then given simply by averaging the numbers for the words in the document. Bizarre but effective!

This chapter is all about the lists of numbers used to represent words and documents, and these lists are called vectors. The discussion of vectors will finally bring us to talk in detail about different dimensions and what they mean to a mathematician. This in itself might be of interest to many readers, and quite different from what you supposed.

Sections

  1. What are Vectors?
  2. Journeys in the Plane
  3. Coordinates, Bases and Dimensions
  4. Search Engines and Vectors
  5. Similarity and Distance Functions for Vectors
  6. Document Vectors and Information Retrieval

Examples, Demo and Software

This chapter introduces the material that drove the Infomap WORDSPACE demo (now deprecated), a search engine which effectively builds its own automatic thesaurus.

The Infomap WORDSPACE software is freely available and can be downloaded from http://infomap-nlp.sourceforge.net/. The software takes a corpus of natural language documents and builds word vectors like those described in this chapter. A more recent, stable and fuller-featured alternative is the SemanticVectors package.


Up to Geometry and Meaning | Back to Chapter 4 | On to Chapter 6