Chapter 4. Measuring Similarity and Distance

So far we have considered sets of objects, and relationships between these objects such as the symmetric relationships in the graph model of Chapter 2, and the asymmetric or vertical relationships of the concept hierarchies in Chapter 3. We have started to feel our way around a graph or a hierarchy, following chains of relationships and learning about the shape of the space in the vicinity of a few interesting example words.

We have still said very little about measuring these relationships. In building the graph we considered how much weight should be given to each link, and gave some rules of thumb for ranking links. When using WordNet, we wanted to give greater significance to the relationship between oak and tree than to the relationship between oak and living thing, because oak is closer to tree that it is to living thing.

In order to obtain more general results, we need to be able to compare relationships more systematically. One of the most tried and tested mathematical techniques for doing this is to measure the distance between two points. Conversely, we might try to measure the similarity between two points (small distances corresponding to large similarities and large distances corresponding to small similarities).

This chapter introduces some of the ways that have been used to measure distances and similarities in mathematical spaces such as graphs and hierarchies. The most standard distance measures in mathematics are called metrics, which must satisfy certain conditions or axioms (such as being symmetric).

However, we must also pay great care to testing whether these mathematical techniques are actually appropriate when we're dealing with language. For example, we shall see that some of the properties of metrics are not always ideal for describing distances between words and concepts, and this chapter presents a few case studies in what can go wrong if we are not very careful and sensitive to the goal of our work, which is not using mathematical ideas of distance but inferring similarity of meaning in natural language.

Measuring similarity between words can enable us to build whole classes of words with similar semantic properties --- sets of words which are all tools or all musical instruments, for example --- and we demonstrate that the graph model of Chapter graph-chapter can be used for this purpose very successfully, provided we show great care in the presence of ambiguous words. Ambiguous words can sometimes behave like semantic wormholes, accidentally transporting us from one area of meaning to another. But we can also use this effect positively, because finding those words which have this strange wormhole effect when measuring distances helps us to recognize which words are ambiguous in the first place.

Sections

1. Distance Functions

This section introduces the idea of a distance function -- a number that is assigned to a pair of points in a space which indicates how far those points are from one another. A distance function is called a metric if it is always positive (except when measuring the distance from any point to itself, which must be zero), if it is always symmetric (no one way streets), and if it permits no short-cuts or wormholes.

Distance functions depend on the space you're working in -- one may choose different ways of measuring distance that are appropriate for different applications. (For example, people often quote driving distances in minutes rather than miles because it is often the more relevant consideration.)

We consider shortest paths in graphs and the Euclidean distance (given by Pythagoras theorem) as examples.

2. Similarity Measures

A similarity measure is the converse of a distance function. Similarity functions take a pair of points and return a large similarity value for nearby points, a small similarity value for distant points. One way to tranform between a distance function and a similarity measure is to take the reciprocal, the standard method for tranforming between resistance and conductance in physics and electronics.

The cosine measure is a very important similarity measure for points in the plane (and in higher dimensions as we shall see later in the book). The cosine measure assigns a high similarity to points that are in the same direction from the origin, zero similarity to points that are perpendicular to one another, and negative similarity for those that are pointing in opposing directions to one another.

This section also describes interesting similarity and distance measures that have been used to compare words in a taxonomy.

3. Which Distance Axioms are Appropriate for Word Meanings?

Mathematical laws used to describe behaviour on one domain are not always appropriate in other domains, and there are long-standing psychological objections to the axioms used to define a distance metric. For example, a metric will always give the same distance from a to b as from b to a, but in practice we are more likely to say that a child resembles their parent than to say that a parent resembles their child.

This section explores some of these objections to the mathematical abstraction, and considers ways of measuring distances between words that are more sensitive to these psychological objections. In particular, by factoring out the overall frequency or popularity of a word in a graph, we show that similarities can be measured in the word graph of Chapter 2 in a way that models the psychological observations much more effectively.

4. Transitive Relationships

If A is the same as B and B is the same as C then it follows that A is the same as C, right? But what if A is similar to B and B is similar to C? Then we need to be a lot more careful about making any further presumptions.

Relationships where you can make inferences like this are called transitive (not to be confused with transitive verbs). Some relationships in natural language and spatial reasoning are transitive, such as the 'is above' relationship. Others are sometimes transitive, such as 'is next to'.

5. Transitivity, Ambiguity and Word-Learning

This section shows why transitivity is important for word-learning, and in particular shows that the transitive assumption is particularly dangerous if used carelessly in the vicinity of ambiguous words. Ambiguous words are often exceptions to the 'no short-cuts' rule or triangle inequality defined by the metric space axioms. A word-learning algorithm can accidentally get transported from one area of meaning to another by an ambiguous word -- in many ways, ambiguous words behave like semantic wormholes in our wordgraph.

6. Splitting the Semantic Atom

We can turn this argument around and use it to detect ambiguous words. Since the transitive assumption is not valid for ambiguous words, we can find nodes around which transitivity is a bad assumption, and these often represent ambiguous words. One way of doing this is to use the curvature measure of Eckmann and Moses, which measures the extent to which the neighbours of a node in a graph are likely also to be neighbours of one another. Another way is to use clustering in the graph and to find nodes whose neighbours belong to radically different clusters. The goal of "splitting the semantic atom" is certainly ambitious and complex, but by this point considerable progress has certainly been made. Further techniques will be described later in the book.

References and Further Research

The following paper explains how my colleague Beate Dorow used a technique called Markov Clustering to find clusters of words in a word-graph signifying different senses of those words. We then used the class-labelling technique introduced in Chapter 3 to map these empirically-derived senses into WordNet.

Beate Dorow and Dominic Widdows, Discovering Corpus-Specific Word Senses. EACL 2003, Budapest, Hungary. Conference Companion (research notes and demos) pages 79-82.

Even more recently, we discovered that treating links as atomic units and aggregating these links to form word-senses is often a much more natural approach than trying to split words. This work, along with the application of the curvature measure to semantic graphs, is presented here in the following paper.

Beate Dorow, Dominic Widdows, Katerina Ling, Jean-Pierre Eckmann, Danilo Sergi and Elisha Moses. Using Curvature and Markov Clustering in Graphs for Lexical Acquisition and Word Sense Discrimination. MEANING-2005, 2nd Workshop organized by the MEANING Project, February 3rd-4th 2005, Trento, Italy.

Up to Geometry and Meaning

Back to Chapter 3

On to Chapter 5