22 October 2008

Pre-, Post-, Uplift

Question: When I'm measuring uplift, do I need to measure behaviour in both a “pre”-period and a “post”-period?

Answer: No. You only need a post-period.

I've been asked essentially this question twice in the last 10 days, first by a client and then by a blog reader (who knew they existed?). And as every investigator knows, two is a pattern, 1 2 so it seemed worth a blog post.

Discussion / Justification

In the absence of control groups, it is natural to investigate the impact of an action by measuring some behaviour before and after the intervention. For example, it is common for people to measure something like usage over a period of six weeks before and after a mailing, and then to attribute any observed change in behaviour to the mailing. Such an approach is exactly the one we tend use instinctively in ordinary life. But of course, the fundamental critique of this that leads us to consider control groups and uplift is that such an approach does not allow us to separate out the effect of our intervention (in this case, the mailing) from everything else that might cause behaviour to change.

I have written in many places that uplift is defined as, in the case of a binary outcome, R, such as purchase,

U = P (R | T) – P (R | C)

where U is the uplift, P is probability (or observed rate), T denotes treatment, C denotes non-treatment (control).

In the case of continuous outcomes, such as spend, S, we similarly have

U = E (S | T) – E (S | C)

where E denotes and expectation value

However, there is a temptation for people instead to define it as (in the binary case)

U = [P (R | T; post) – P (R | T; pre)] – [P (R | C; post) – P (R | C; pre)] (FLAWED!)

where “post” and “pre” denote measurements after and before the time at which the treatment is applied.

This is wrong.

In order for the control group to be valid, it must be drawn from the same population as the treated group. In practice this means that we must first form a candidate “eligible” population and then randomly allocate people from it to each of the treatment and control populations (though not necessarily in equal numbers). If this is true, then obviously the probability of the outcome of interest is by definition the same in the two populations before the treatment. That is not, of course, to say that if we measure it we will observe an identical rate, but if our control group is valid, any variation will be due to sampling error. (Indeed, for this reason, it is often a good idea to stratify the population on the basis of previous response behaviour, so that we do get identical values for P (R|T; pre) and P (R|C; pre), up to integer rounding effects.)

So the first thing to note is that if our sample is valid, the more complex formula reduces to the simpler one. To the extent that it does not, it is less accurate, by virtue of having added in extra noise through pointless sampling error. This, if you like, is the numerical objection to the more involved formula.

But there is an even more important philosophical objection to the more complex formula. For it conflates two quite different ideas. The first is the one we are interested in—the impact of our treatment. The other is the change in behaviour over time. And while it is true that our goal is typically (for example) to increase a purchase rate over time, that is emphastically not what we should be trying to measure here. The whole idea of marketing is to change behaviour relative to what it would have been without the intervention. If sales would have fallen without the intervention, but our intervention reduces that fall, then our intervention has had a positive impact. Of course, “the patient may still die”,3 and that is very much a matter of legitimate concern, but the goal of uplift is to measure the success of the treatment.

Footnotes

1As a theoretical physicist by training, I have to work hard to avoid falling into the trap of following Maier's Law, which states that “when the facts don't conform to the theory, they must be disposed of” (attributed to Maier, N. R. F. (1929), Reasoning in White Rats. Comp. Psy. Mono, 6 29, in Roeckelein, Jon E., Dictionary of Theories, Laws and Concepts in Psychology, Greenwood Publishing Group (Westport) 1998, as “if the data do not fit the theory, the data must be disposed of”.) I remember a colleague, after much work, producing a single data point and proudly showing me a graph containing said observation and the theoretical behaviour. What was striking was that the observation was nowhere near the curve. This didn't perturb my colleague unduly, leading to the observation that it takes a theoretical physicist to fail to fit a curve through even a single data point!

2It was of course Oscar Wilde, who wrote, “To lose one parent, Mr Worthing, may be regarded as a misfortune; to lose both looks like carelessness.” (The Importance of Being Earnest, Oscar Wilde). While theoreticians are often guilty of playing too fast and loose with the data, it's depressing how many non-scientists (and, for that matter, scientists!) are impressed beyond reason when they manage to fit a straight line through two data points.

3 “The treatment was a complete success; unfortunately, the patient died.” I am distressed to be unable to find a source for this aphorism; if anyone has a reliable source, please do let me know.

Labels: ,