
An illustration of Simpson's paradox for continuous data: while a positive trend is seen for the two separate groups (blue and red), a negative trend (black, dashed) appears when the data is combined.
'Simpson's paradox' (or the 'Yule-Simpson effect') is a
statistical paradox in which the successes of several groups seem to be reversed when the groups are combined. This seemingly impossible result is encountered surprisingly often in social science and medical statistics,
[1] and occurs when a weighting variable which is not relevant to the individual group assessment must be used in the combined assessment. Judea Pearl
[2] has argued that the effect appears paradoxical only because of our tendency to give causal interpretation to changes in proportions.
While most people do not know about this paradox, it is well known to statisticians and is described in several introductory statistics books.
[3] Many statisticians believe that the public should be made more aware of counterintuitive results such as Simpson's paradox,
[4] in particular to caution against the inference of causal relationships based on the association between two variables.
[5]
The phenomenon was described by
Edward H. Simpson in
1951[6],
Karl Pearson et al.,
[7]
and
Udny Yule in
1903[8].
The name 'Simpson's paradox' was coined by Colin R. Blyth in 1972.
[9]
Since Simpson was not the discoverer of this paradox, some authors have used instead impersonal names such as reversal paradox or amalgamation paradox to refer to it.
[10]
Examples
Batting averages
A common example of the paradox involves
batting averages in baseball: it is possible for one player to hit for a higher batting average than another player during a given year, and to do so again during the next year, but to have a lower batting average when the two years are combined. This phenomenon is well-known among sports
sabermetricians such as
Bill James, who has called attention to it.
A real-life example is provided by Ken Ross
[11] and involves the batting average of baseball players
Derek Jeter and
David Justice during the years 1995 and 1996:
[12]
| 1995 | 1996 | Combined |
|---|
| Derek Jeter | 12/48 | .250 | 183/582 | .314 | 195/630 | '.310' |
| David Justice | 104/411 | '.253' | 45/140 | '.321' | 149/551 | .270 |
In both 1995 and 1996, Justice had a higher batting average (in bold) than Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice. According to Ross, this phenomenon would be observed about once per year among the interesting baseball players. In this particular case, the paradox can still be observed if the year 1997 is also taken into account:
| 1995 | 1996 | 1997 | Combined |
|---|
| Derek Jeter | 12/48 | .250 | 183/582 | .314 | 190/654 | .291 | 385/1284 | '.300' |
| David Justice | 104/411 | '.253' | 45/140 | '.321' | 163/495 | '.329' | 312/1046 | .298 |
Kidney stone treatment
This is a real-life example from a medical study
[13] comparing the success rates of two treatments for
kidney stones.
[14]
The first table shows the overall success rates and numbers of treatments for both treatments:
| Treatment A | Treatment B |
|---|
| 78% (273/350) | '83% (289/350)' |
This seems to show treatment B is more effective. If we include data about kidney stone size, however, the same set of treatments reveals a different answer:
| Treatment A | Treatment B |
|---|
| Small Stones | ''Group 1'' '93% (81/87)' | ''Group 2'' 87% (234/270) |
|---|
| Large Stones | ''Group 3'' '73% (192/263)' | ''Group 4'' 69% (55/80) |
|---|
| Both | 78% (273/350) | '83% (289/350) |
|---|
The information about stone size has reversed our conclusion about the effectiveness of each treatment. Now treatment A is seen to be more effective in both cases. In this example the lurking variable (or
confounding variable) of stone size was not previously known to be important until its effects were included.
Which treatment is considered better is determined by an inequality between two ratios (successes/total). The reversal of the inequality between the ratios, which creates Simpson's paradox, happens because two effects occur together:
# The sizes of the groups which are combined when the lurking variable is ignored are very different. Doctors tend to give the severe cases (large stones) the better treatment (A), and the milder cases (small stones) the inferior treatment (B). Therefore, the totals are dominated by groups 3 and 2, and not by the two much smaller groups 1 and 4.
# The lurking variable has a large effect on the ratios, i.e. the success rate is more strongly influenced by the severity of the case than by the choice of treatment. Therefore, the group of patients with large stones using treatment A (group 3) does worse than the group with small stones, even if the latter used the inferior treatment B (group 2).
Berkeley sex bias case
One of the best known real life examples of Simpson's paradox occurred when the
University of California, Berkeley was sued for bias against women applying to
graduate school. The admission figures for fall 1973 showed that men applying were more likely than women to be admitted, and the difference was so large that it was unlikely to be due to chance.
[15][3]
| Applicants | % admitted |
|---|
| Men | 8442 | '44%' |
| Women | 4321 | 35% |
However when examining the individual departments, it was found that no department was significantly biased against women; in fact, most departments had a small bias against men.
| Major | Men | Women |
|---|
| Applicants | % admitted | Applicants | % admitted |
| A | 825 | 62% | 108 | '82%' |
| B | 560 | 63% | 25 | '68%' |
| C | 325 | '37%' | 593 | 34% |
| D | 417 | 33% | 375 | '35%' |
| E | 191 | '28%' | 393 | 24% |
| F | 272 | 6% | 341 | '7%' |
The explanation turned out to be that women tended to apply to departments with low rates of admission, while men tended to apply to departments with high rates of admission. The conditions under which department-specific frequency data
constitue a proper defense against charges of
discrimination are formulated in Pearl (2000).
2006 US school study
In July 2006, the
United States Department of Education released a study
[17] documenting student performances in reading and math in different school settings.
[18] It reported that while the math and reading levels for students at grades 4 and 8 were uniformly higher in private/parochial schools than in public schools, repeating the comparisons on demographic subgroups showed much smaller differences which were nearly equally divided in direction.
Low birth weight paradox
Main articles: Low birth weight paradox
The low birth weight paradox is an apparently
paradoxical observation relating to the birth
weights and mortality of children born to
tobacco smoking mothers. Traditionally, babies weighing less than a certain amount (which varies between
countries) have been classified as having ''
low birth weight''. In a given population, low birth weight babies have a significantly higher
mortality rate than others. However, it has been observed that low birth weight children born to smoking mothers have a ''lower'' infant mortality rate than the low birth weight children of non-smokers.
[19]
Description of the paradox

Illustration of Simpson's Paradox; The first graph represents Lisa's contribution, the second one Bart's. The red bars represent the first week, the blue bars the second week; the triangles indicate the combined percentage of good contributions (weighted average). While Bart's bars both show a higher rate of success than Lisa's, Lisa's combined rate is higher because most of her contributions were good, while most of Bart's are of lower quality.
To illustrate the paradox, suppose two people, Lisa and Bart, each edit
Wikipedia articles for two weeks. In the first week, Lisa improves 60 percent of the articles she edits while Bart improves 90 percent of the articles he edits. In the second week, Lisa improves just 10 percent of the articles she edits, while Bart improves 30 percent.
Both times, Bart improved a much higher percentage of articles than Lisa — yet when the two tests are combined, Lisa has improved a much higher percentage than Bart!
| Week 1 | Week 2 | Total |
|---|
| Lisa | 60% | 10% | '55.5%' |
|---|
| Bart | '90%' | '30%' | 35.5% |
|---|
This result comes about because of the varying number of articles worked on by each person - information not presented in the initial presentation. In the first week, Lisa edits 100 articles, improving 60 of them, while Bart edits just 10 articles, improving all but one. In the second week, Lisa edits only 10 articles, improving one, while Bart edits 100 articles, improving 30. When two week's worth of work is combined, both edited the same number of articles, yet Lisa improved 55% of them (61 in total) while Bart improved only 35% of them (39 in total).
| Week 1 | Week 2 | Total |
|---|
| Lisa | 60/100 | 1/10 | '61/110' |
|---|
| Bart | '9/10' | '30/100' | 39/110 |
|---|
To recap, introducing some notation that will be useful later:
★ In the first week
:
★
— Lisa improved 60% of the many articles she edited.
:
★
— Bart had a 90% success rate during that time.
: Success is associated with Bart.
★ In the second week
:
★
— Lisa managed 10% in her busy life.
:
★
— Bart achieved a 30% success rate.
: Success is associated with Bart.
On both occasions Bart's edits were more successful than Lisa's. But if we combine the two sets, we see that Lisa and Bart both edited 110 articles, and:
★
— Lisa improved 61 articles.
★
— Bart improved only 39.
★
— Success is now associated with Lisa.
Bart is better for each set but worse overall!
The paradox stems from our healthy intuition that Bart could not
possibly be a better editor
on each set but worse overall. Pearl (2000) in fact
proved the impossibility of such happening, where
"better editor" is taken in the counterfactual sense:
"Were Bart to edit all items in a set he would do
better than Lisa would, on those same items."
Clearly, frequency data cannot support
this sense of "better editor," because it does
not tell us how Bart would perform on items
edited by Lisa, and vice versa. In the back
of our mind we assume that the articles were
assigned at random to Bart and Lisa, an assumption which (for large sample) would support the counterfactual
interpretation of "better editor." However, under
random assignment conditions, the data given in
this example is impossible, which accounts for
our surprise when confronting the rate reversal.
The arithmetical basis of the paradox is uncontroversial. If
and
we feel that
''must be greater'' than
. However if ''different'' weights are used to form the overall score for each person then this feeling may be disappointed. Here the first test is weighted
for Lisa and
for Bart while the weights are reversed on the second test.
★
★
By more extreme reweighting A's overall score can be pushed up towards 60% and B's down towards 30%.
Lisa is a better editor on average, as her overall success rate is higher. But it is possible to have told the story in a way which would make it appear obvious that Bart is more diligent.
Simpson's paradox shows us an extreme example of the importance of including data about possible confounding variables when attempting to calculate causal relations. Precise criteria for selecting a set of "confounding variables,"
(i.e., variables that yield correct causal relationships if included in the analysis),
is given in (Pearl, 2000) using causal graphs.
While Simpson's paradox often refers to the analysis of count tables, as shown in this example, it also occurs with continuous data:
[20] for example, if one fits separated
regression lines through two sets of data, the two regression lines may show a positive trend, while a regression line fitted through all data together will show a ''negative'' trend, as shown on the picture above.
Vector interpretation

thumb
Simpson's paradox can also be illustrated using the 2-dimensional
vector space. A success rate of
can be represented by a
vector , with a
slope of
. If two rates
and
are combined, as in the examples given above, the result can be represented by the sum of the vectors
and
, which, according to the
parallelogram rule is the vector
, with slope
.
Simpson's paradox says that even if a vector
(in blue in the figure) has a smaller slope than another vector
(in red), and
has a smaller slope than
, the sum of the two vectors
(indicated by "+" in the figure) can still have a larger slope than the sum of the two vectors
, as shown in the example.
References
1. Simpson's Paradox in Real Life, Clifford H. Wagner, , , The American Statistician, 1982
2. Judea Pearl. Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000. ISBN 0-521-77362-8.
3. David Freedman, Robert Pisani and Roger Curves. Statistics (3rd edition). W.W. Norton, 1998. ISBN 0-393-97083-3.
4. Robert L. Wardrop (February 1995). "Simpson's Paradox and the Hot Hand in Basketball". ''The American Statisticain'', '49 (1)': pp. 24–28.
5. Alain Agresti (2002). "Categorical Data Analysis" (Second edition). John Wiley and Sons. ISBN 0-471-36093-7
6. The Interpretation of Interaction in Contingency Tables, Simpson, Edward H., , , Journal of the Royal Statistical Society, Ser. B, 1951
7.
8.
9. On Simpson's Paradox and the Sure-Thing Principle, Colin R. Blyth, , , Journal of the American Statistical Association, 1972
10. The Amalgamation and Geometry of Two-by-Two Contingency Tables, I. J. Good, Y. Mittal, , , The Annals of Statistics, 1987
11. Ken Ross. "''A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans (Paperback)''"
Pi Press, 2004. ISBN 0131479903. 12–13
12. Statistics available from http://www.baseball-reference.com/ : Data for Derek Jeter, Data for David Justice.
13.
14.
15. Sex Bias in Graduate Admissions: Data From Berkeley, P.J. Bickel, E.A. Hammel and J.W. O'Connell, , , Science, 1975 .
16. David Freedman, Robert Pisani and Roger Curves. Statistics (3rd edition). W.W. Norton, 1998. ISBN 0-393-97083-3.
17. H. Braun, F. Jenkins and W. Grigg, (2006) "Comparing Private Schools and Public Schools Using Hierarchical Linear Modeling, U.S. Department of Education, National Center for Education Statistics, Institute of Education Sciences, Washington, DC, United States Government Printing Office.
18. Diana Jean Schemo. "Public Schools Perform Near Private Ones in Study. The New York Times, 15 July 2006. Retrieved on 25 July 2007.
19. Wilcox, Allen (2006). "The Perils of Birth Weight — A Lesson from Directed Acyclic Graphs". ''American Journal of Epidemiology''. 164(11):1121–1123.
20. John Fox (1997). "Applied Regression Analysis, Linear Models, and Related Methods". Sage Publications. ISBN 080394540X. 136–137
External links
For a brief history of the origins of the paradox see the entries on Simpson's Paradox and Spurious Correlation in
★
Earliest known uses of some of the words of mathematics: S
Other links:
★
"The Art and Science of Cause and Effect": a slide show and tutorial lecture by Judea Pearl
★
Simpson's Paradox: An Anatomy by Judea Pearl
★
Mediant Fractions at
cut-the-knot
★
Simpson's Paradox at
cut-the-knot
★
Stanford Encyclopedia of Philosophy entry