Anscombe's quartet

{{Short description|Four data sets with the same descriptive statistics, yet very different distributions}}

File:Anscombe's quartet 3.svgs composing Anscombe's quartet. All four sets have identical statistical parameters, but the graphs show them to be considerably different]]

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".{{cite journal |last=Anscombe |first=F. J. |authorlink=Frank Anscombe |title=Graphs in Statistical Analysis |journal=American Statistician |volume=27 |year=1973 |issue=1 |pages=17–21 |jstor=2682899 |doi=10.1080/00031305.1973.10478966}}

Data

For all four datasets:

class="wikitable"

! Property

! Value

! Accuracy

Mean of x

| 9

| exact

Sample variance of x: s{{supsub|2|x}}

| 11

| exact

Mean of y

| 7.50

| to 2 decimal places

Sample variance of y: s{{supsub|2|y}}

| 4.125

| ±0.003

Correlation between x and y

| 0.816

| to 3 decimal places

Linear regression line

| y = 3.00 + 0.500x

| to 2 and 3 decimal places, respectively

Coefficient of determination of the linear regression:

R^2

| 0.67

| to 2 decimal places

The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two correlated variables, where y could be modelled as gaussian with mean linearly dependent on x.
For the second graph (top right), while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.
In the third graph (bottom left), the modelled relationship is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier, which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
Finally, the fourth graph (bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.{{cite journal |url=http://physics.info/linear-regression/practice.shtml#4 |title=Linear Regression |journal=The Physics Hypertextbook |last=Elert |first=Glenn |year=2021 |access-date=2017-02-23 |archive-date=2020-10-01 |archive-url=https://web.archive.org/web/20201001193224/http://physics.info/linear-regression/practice.shtml#4 |url-status=live }}{{cite book |last=Janert |first=Philipp K. |title=Data Analysis with Open Source Tools |year=2010 |publisher=O'Reilly Media |pages=[https://archive.org/details/isbn_9780596802356/page/65 65–66] |isbn=978-0-596-80235-6 |url=https://archive.org/details/isbn_9780596802356/page/65 }}{{cite book |last1=Chatterjee |first1=Samprit |last2=Hadi |first2=Ali S. |year=2006 |title=Regression Analysis by Example |publisher=John Wiley and Sons |page=91 |isbn=0-471-74696-7}}{{cite book |last1=Saville |first1=David J. |last2=Wood |first2=Graham R. |year=1991 |title=Statistical Methods: The geometric approach |publisher=Springer |page=418 |isbn=0-387-97517-9}}{{cite book |last=Tufte |first=Edward R. |authorlink=Edward Tufte |year=2001 |title=The Visual Display of Quantitative Information |edition=2nd |location=Cheshire, CT |publisher=Graphics Press |isbn=0-9613921-4-2 |url=https://archive.org/details/visualdisplayofq00tuft }}

The datasets are as follows. The x values are the same for the first three datasets.

colspan="2"\| Dataset I ! colspan="2"\| Dataset II ! colspan="2"\| Dataset III ! colspan="2"\| Dataset IV
class="wikitable" style="text-align: center" \|+ Anscombe's quartet
x \| y \| x \| y \| x \| y \| x \| y
10.0	8.04	10.0	9.14	10.0	7.46	8.0	6.58
8.0	6.95	8.0	8.14	8.0	6.77	8.0	5.76
13.0	7.58	13.0	8.74	13.0	12.74	8.0	7.71
9.0	8.81	9.0	8.77	9.0	7.11	8.0	8.84
11.0	8.33	11.0	9.26	11.0	7.81	8.0	8.47
14.0	9.96	14.0	8.10	14.0	8.84	8.0	7.04
6.0	7.24	6.0	6.13	6.0	6.08	8.0	5.25
4.0	4.26	4.0	3.10	4.0	5.39	19.0	12.50
12.0	10.84	12.0	9.13	12.0	8.15	8.0	5.56
7.0	4.82	7.0	7.26	7.0	6.42	8.0	7.91
5.0	5.68	5.0	4.74	5.0	5.73	8.0	6.89

It is not known how Anscombe created his datasets.{{cite journal |last1=Chatterjee |first1=Sangit |last2=Firat |first2=Aykut |year=2007 |title=Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset |journal=The American Statistician |volume=61 |issue=3 |pages=248–254 |doi=10.1198/000313007X220057| jstor=27643902|s2cid=121163371 }} Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.{{cite book |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |title=Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems |chapter=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |year=2017 |pages=1290–1294 |doi=10.1145/3025453.3025912|isbn=9781450346559 |s2cid=9247543 }}

One of these, the Datasaurus dozen, consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.{{Cite web |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |date=2017 |title=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |url=https://www.autodesk.com/research/publications/same-stats-different-graphs |url-status=live |access-date=2021-04-20 |website=Autodesk Research |language=en-US |archive-url=https://web.archive.org/web/20201004003855/https://www.autodesk.com/research/publications/same-stats-different-graphs |archive-date=2020-10-04 }}{{Cite journal |last1=Murray |first1=Lori L. |last2=Wilson |first2=John G. |date=April 2021 |title=Generating data sets for teaching the importance of regression analysis |url=https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233 |journal=Decision Sciences Journal of Innovative Education |language=en |volume=19 |issue=2 |pages=157–166 |doi=10.1111/dsji.12233 |s2cid=233609149 |issn=1540-4595 |access-date=2021-04-20 |archive-date=2021-04-23 |archive-url=https://web.archive.org/web/20210423155254/https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233 |url-status=live }}{{Citation |last1=Andrienko |first1=Natalia |author1-link=Natalia Andrienko |title=Visual Analytics for Investigating and Processing Data |date=2020 |url=http://link.springer.com/10.1007/978-3-030-56146-8_5 |work=Visual Analytics for Data Scientists |pages=151–180 |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-56146-8_5 |isbn=978-3-030-56145-1 |access-date=2021-04-20 |last2=Andrienko |first2=Gennady |last3=Fuchs |first3=Georg |last4=Slingsby |first4=Aidan |last5=Turkay |first5=Cagatay |last6=Wrobel |first6=Stefan |s2cid=226648414 |postscript=. |archive-date=2024-10-03 |archive-url=https://web.archive.org/web/20241003162552/https://link.springer.com/chapter/10.1007/978-3-030-56146-8_5 |url-status=live }}

References

External links

[http://www.upscale.utoronto.ca/GeneralInterest/Harrison/Visualisation/Visualisation.html Department of Physics, University of Toronto]
[https://www.geogebra.org/m/tbwXxySn Dynamic Applet] made in GeoGebra showing the data & statistics and also allowing the points to be dragged (Set 5).
[https://www.autodeskresearch.com/publications/samestats Animated examples from Autodesk] called the "Datasaurus Dozen".
[https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/anscombe.html Documentation] for the datasets in R.

Category:Misuse of statistics

Category:Statistical charts and diagrams

Category:Statistical data sets

Category:1973 introductions

Category:1973 in science

Category:Data and information visualization

Anscombe's quartet

Data

See also

References

External links