11 February 2007

Clustering Considered Harmful: Outline

Marvin.   What’s wrong?   We’re doomed!   It’s the curse of dimensionality.   All our segments are meaningless!    Oh dear . . . Would it help if we renamed them?   It’s worse than that: we’re going to have to look at the data, Jim.

It's pretty obvious the distribution of matter in space is lumpy. Matter clumps into planets, planets orbit stars, stars clump into galaxies, galaxies group together into clusters, and―wouldn't you know it―clusters form superclusters. Thank gravity for that. There are clumps of matter, and other things, at smaller scales too. Atoms are mostly empty space, with a lot of stuff at the centre (the nucleus), and electrons like to hang around at particular distances away from the nucleus (the shells), though it's hard to pin them down. Similarly, people clump together on the Earth. London, Tokyo and Sao Paulo are pretty crowded; the Sahara, the Highlands of Scotland and central Australia mostly aren't. (People are quite hard to pin down, too.)

Cluster analysis is a set of techniques for taking the coordinates of a lot of objects (stars, particles, people…) and figuring out something about where the lumps are. There are lots of ways to do it.

Someone, deep in the mists of time, had the bright idea of applying cluster analysis to customers to figure out “where the clumps are”. The idea wasn't to use geographical coordinates (of their houses, say), but to replace coordinates with customer characteristics, like demographics (age, income etc.), behavioural measurements (spend levels, frequencies, balances etc.), and maybe attitudinal things like psychographics. That way, they thought, they might uncover the “natural groupings” of customers, which could be useful for understanding their dynamics and for segmenting them.

While it was far from a stupid idea, it turns out that it was an extremely bad idea, one that at best has wasted countless thousands of hours of analyst time, and at worst has led to baseless conclusions and highly suboptimal marketing.

There are far too many problems to do justice to in a single blog post, so I won't. Instead, I'll list some headlines here, and over the coming weeks I'll do an entry on each one. Then maybe I'll gather them into an absurdly long whole.

Here are some of the headline reasons that clustering customer characteristics isn't useful.

  • There's no real evidence that customers cluster.
  • Different customer characteristics are non-commensurate.
  • Circularity: practitioners think they're just finding “the natural clusters”, but in fact the results are entirely dictated by decisions made up-front (often without realising it) about scaling. Different choices lead to different scalings, so clusters are unstable.
  • The curse of dimensionality means that clustering doesn't really work in more than a few dimensions.
  • Clustering is undirected.
  • Clusters are hard to interpret. So people give them names. And then the names become the meaning.
  • For (almost) every problem tackled with undirected clustering, there's a directed approach that will (almost always) work better.

For avoidance of doubt (as lawyers say), and notwithstanding the impression the title of this entry may give, the problem isn't cluster analysis per se, which is a perfectly fine collection of techniques. If you want to find the clumps in a low-dimensional space with commensurate dimensions, it's exactly what you need. It's just that that isn't a very good description of a customer base.

Here are the parts posted so far:

Labels: , ,