03 March 2009

Clustering Considered Harmful II: Distance and Scaling

[This is the second in a planned series of posts on clustering (outlined here) that I'll publish over the coming period. When it's finished, I'll collect it all together into a properly printable paper.]

I said in the previous post that pretty much all you have to do to apply cluster analysis to a dataset is to pass in the coordinates of each point to be clustered. And this is the case when the points exist in Euclidean space or something similar—anywhere, in fact, where there is a well-defined, unambiguous and reasonable distance function (metric).

But whereas we can ordinarily agree the distance between each pair of objects in our solar system (at a given point in time), or indeed, between each pair of people on the earth ("as the crow flies"), defining the distance between a pair of customers in a customer database requires rather more decisions to be taken.

Numeric Variables with Common Units

First, even for those variables or dimensions that are ordinary numbers, there are potential questions of units and scaling. Even if all they are all expressed as units of a common currency (say, pounds sterling), there might be a question as to whether, for example, a difference of £1,000 in value of house (say between £248,000 and £249,000) is really equivalent to (say) a difference of £1,000 a month in interest payments (perhaps between £200 and £1,200); or indeed, whether it is useful to treat either of these as equivalent to a difference of £1,000 a month in change in savings balance (say between +£500 and –£500).

Numeric Variables with Different Units

There is even greater scope for argument when the variables measure fundamentally different kinds of things. Consider, for example, the variables number of children, and income. Suppose we measure these for four people as follows:

Customer  Number of children  Income
Ann 1 £15,000
Bianca 1 £40,000
Ciara 0 £15,000
Dee 3 £20,000

Which of Bianca, Ciara and Dee is closest to Ann? What is the appropriate conversion rate between number of children and income? There is, of course, no correct answer. It depends what we mean; and what we are trying to achieve.

Z-Scores

It is important to emphasize that the nature of the challenge here is not technical: there is no difficulty at all in devising conversion rates, and even rationales for those conversion rates. One particularly common approach is to compute so-called z-scores, by subtracting the population mean from each variable and dividing the result by the standard deviation, i.e.

z(x) = (x – μ) / σ.

While this approach unarguably provides an unique and unambiguous way of setting the determining a scale among numeric variables, it does rather seem to fly in the face of the very object of our exercise, which is to understand the inhomogeneities.

To illustrate why z-scores are not really a solution, suppose we were trying to cluster stars in our galaxy (space being a domain in which the three spatial dimensions clearly are commensurate). As is well known, the Milky Way is essentially a disc whose thickness is a mere thousand light years but whose diameter is some hundred times greater, i.e. roughly 100,000 light years. Let us suppose we place the origin of a (Euclidean) coordinate system at the centre of the galaxy, with x- and y-axes in the plane of its disc, and a z-axis perpendicular to it. Then, using the z-score approach, we would treat the average distance from the galactic centre perpendicular to the plane of the galaxy (presumably rather less than 250 light years) as being equivalent (in fact, equal) to the average distance from the centre in the plane of the disc, which we may suppose is more like 20,000 light years. Needless to say, this is madness.

Categorical Variables

While it is not necessary to run through every type of data that might occur, categorical variables are common enough to merit some discussion. Whether coded as numbers or strings, the defining characteristic of categorical data is that the values are unordered. Common examples in customer data include gender (typical values male; female) house type (flat; terraced; semi-detatched; detached) or last purchase (Film: Monty Python's Life of Brian (DVD); Book: The Unbearable Lightess of Being, Milan Kundera; Beverage: 70cl bottle of Ardbeg (10 yr) Single Malt Whisky). In some cases, the different values are all simply different—islands unto themselves—while in others they may fall into some (possibly partially ordered) structure such as a hierarchy; or indeed into several hierarchies.

When considering the distance between two members of our customer base, how then should we handle categorical data? And how should we combine categorical and non-categorical variables into a single, meaningful distance measure? Is the difference between a man and a woman zero, £1,000, 2.5 inches, 15lbs, 30 points of emotional intelligence or is the question too silly for words?

It is again important to be clear that the challenge is not technical. We can certainly use the discrete metric and simply say that if the two genders or house types or most recent purchases are the same, then the distance between them is zero, and if they are different it is one. But as before, it is hard not to feel that this solution is more satisfactory from the perspective of one who seeks to hammer the square peg of customer reality into the round hole of clustering than to a either a passionate or dispassionate inquisitor seeking truth and insight into her customers; it is the tidy-but-unhelpful bureaucrat's solution.

In the next part of this series, we will focus on the question of commensurability.

Labels: , ,