03 March 2009

Clustering Considered Harmful: I. Introduction

[This is the first in a planned series of posts on clustering (outlined here) that I'll publish over the coming period. When it's finished, I'll collect it all together into a properly printable paper.]

Lumpiness

Wherever we look we see inhomogeneity. The universe is lumpy at all scales. Stars clump together as galaxies, galaxies form clusters, and clusters themselves cohere into superclusters. More locally, almost all the matter in our solar system is in the Sun; most of that which isn't is in Jupiter; and almost all the rest is tightly packed into other planets. Even towards the bottom, matter is lumpy, each atom being made up of a tiny sun-like nucleus surrounded mostly by empty space and a very diffuse cloud of point-like electrons.

People clump too. Rather than spreading ourselves out uniformly over the planet's landmass, we form dense populations in some areas (Monaco; Singapore; Bangladesh) and inhabit other places more sparsely (Greenland; Australia; Canada). Similarly, most countries have their high-density cities (Mumbai; Karachi; Lagos) and their sparse regions (the Gobi; Wyoming; the Rannoch Moor).

Truly, this is one lumpy universe.

Cluster Analysis

Cluster analysis is the name given to a collection of techniques for finding the clumps (or clusters). The idea is simply to make a list of the coordinates of the items of interest (the stars, the people or the atoms), possibly to set some parameters (like a target number of clusters), then to feed the list into the algorithm which divides it into groups—or "clusters". Typically, the clusters will be characterized by a cluster centre (the coordinate of some centre of gravity of its members), and perhaps some measures of compactness.

There are two main families of techniques for clustering: "bottom up" techniques start by assigning each point to its own cluster and then merge clusters in close proximity (so-called agglommerative or accretive methods), while "top-down" approaches begin by creating a single cluster and then look for ways to divide it and its children into ever smaller subclusters (so-called divisive approaches). There are interesting and bizarre variations, particularly so-called self-organizing maps, of which the Kohonen Network is a popular example with appealing properties.

Numerous algorithms exist, and many software packages, free and commercial, implement one or more. I will mention David Wishart's Clustan as a fine example, and David has applied his software and his approach widely, not least to the important problem of classifying single malt whiskies1 (though I personally favour an approach based more on repeated sampling without replacement . . .)

From Matter and Space to Customers and Marketing

Though I do not know, I imagine that somewhere in the not-too-distant mists of time, a marketer wanting to understand his (or her) customers must have read about cluster analysis and had the idea of applying it to his customer database. "After all," he perhaps reasoned, "we often think of the columns in our database as dimensions. Maybe, just as stars naturally clump together in space, customers naturally cluster in the virtual hyperspace of my database; and if so, perhaps cluster analysis could help me to uncover the natural, intrinsic, underlying structure within my customer base. Separating my customers into their natural groups," he may have continued to reason, "will surely help me to design marketing programmes that are well tuned to their different needs, attitudes, aspirations and behaviours."

Whether it happened quite like this or not, certainly it has become a practice among some marketers to do just this. The rest of this series will critically examine this approach and (ultimately) argue that there are some fairly major challenges and problems with the application of clustering methods to customer databases, and that companies' interests would often be better served by alternative, and more direct approaches.

In the next part of this series (now published!), we will focus on the questions of distance and scaling.

Notes

1 Whisky Classified. David Wishart, Pavilion Books, (London) 2006.

Labels: , ,