Protocol
Author
Alexander Zwart
Overview
The linear (Pearson) correlation coefficient r and the coefficient of determination R2
Definition
The Pearson correlation coefficient, r, is a measure of the strength of a linear (i.e., straight line) relationship between two variables.
The coefficient of determination, R2, is the proportion of the variation in a response variable that is explained by a fitted statistical model. R2 is most often expressed as a percentage, and a variant, the -adjusted R2 ‘ is relevant for models with multiple predictor (explanatory) variables.
Terminology and equations
The linear (Pearson) correlation coefficient, r
In its most general sense, -correlation’ implies the presence of a systematic relationship (association) between two variables. In more common usage, correlation is often implied to mean -linear correlation’ – a measure of the strength of a straight-line relationship between two variables.
A commonly-used measure of linear correlation is the Pearson correlation coefficient, whose -true’ or -population’ value is usually denoted by the greek letter (rho), and whose observed (sample) value is usually denoted r. Given a set of data pairs (xi, yi), i=1…n, two equivalent formulae for the observed Pearson correlation coefficient are:
…where and
are the sample means of the xi and yi respectively, and sx and sy are the respective sample standard deviations. See Protocol: standard deviations and standard errors.
The Pearson correlation coefficient is a number between -1 and 1. A value of 0 implies no linear correlation between the two variables. Values of 1 or -1 imply perfect positive or negative linear relationships between the two variables, respectively.
It is important to remember that:
- Although the Pearson correlation measures the strength of linear relationship between two variables, the presence of a high (positive or negative) correlation does not necessarily imply actual straight line behaviour of the data – it is always important to check the assumption of linearity visually;
- Similarly, the value for the Pearson correlation is sensitive to the presence of extreme outliers
- The presence of a correlation between two variables does not imply a causal relationship between the two variables (-correlation does not imply causation’)
Points (1) and (2) are best checked by plotting the relationship between the two variables: A nice example of the dangers in interpreting correlation without plotting the data can be illustrated by the example of Anscombe’s quartet, a set of four (artificial) datasets with almost identical correlations of approximately r = 0.816 :
Of these, only the top-left hand graph represents a situation where the Pearson correlation (and indeed, the assumption of a linear relationship between the variables) is -sensible’.
Other measures of correlation exist. Two of note are Spearman’s rank correlation coefficient and Kendall’s rank correlation coefficient. These -nonparametric’ measures of association do not measure linear correlation however; rather, they measure monotonicity (the tendency for the relationship to always increase (positive) or always decrease (negative)).
The coefficient of determination, R2
The coefficient of determination or R2, usually expressed as a percentage, arises in the context of statistical linear modeling, particularly in regression, and is usually quoted in the output from such analyses. The coefficient of determination represents the -percentage of the variation in the response (the – y ‘ variable) that is explained by the predictor(s) (the – x ‘ variable(s))’. R2 ranges from 0% (no linear relationship between response and predictor(s)) to 100% (a perfect linear relationship between response and predictor(s), with no random error present at all).
In a simple linear regression model, R2 is simply the square of the Pearson regression coefficient, r, and hence allows somewhat similar interpretation in terms of -the strength of a linear relationship between response y and predictor x ‘.
(Aside – in using lower case r for correlation and upper case R2 for the coefficient of determination, I am following what seems to be the more common conventions used in practice, despite the minor inconsistency in notation that this implies).
However, in practice, R2 tends to be interpreted specifically as the ability of the statistical model to -explain’ the variation in the response variable. The sorts of values of R2 that are large enough to be of interest, is a question that depends upon the application of the model. In basic statistics courses, it is often quoted that an R2 of 80% implies a very strong fit to the data, but in practical applications, we are often happy to obtain R2 values much lower than this (especially when dealing with biological data)!
Similar caveats apply in the interpretation of R2 as applied in the interpretation of the correlation coefficient – a high value for R2 does not, in fact, guarantee that the model is appropriate for the data, if the data does not follow the systematic behaviour assumed by the statistical model. The relationship in the following graph has an R2 value (from a simple linear regression fit) of 90%, yet it would be foolish to assume that the data (points) are actually following the fitted linear relationship (line).
The adjusted coefficient of determination,R2Adj
In statistical modeling techniques such as multiple linear regression, the statistical model being fitted to the response y includes more than one predictor. Here, the usual R2 measure tends to be an overly optimistic indicator of the strength of fit to the data. This is because the fit of a statistical model to a dataset automatically improves as one adds in more predictors to the model. Indeed, by adding as many predictors to a multiple regression model as there are data points to be fitted, one automatically achieves a perfect fit of the model to the data! Such -overfitted’ models are essentially useless, as they are not separating the systematic behaviour of the data from the random behaviour.
To reduce the possibility of such -overfitting’, the formula for the adjusted R2 includes a -penalty’ term which increases with the number predictors in the statistical model. The resulting R2Adj provides a more reasonable estimate of how well the model actually explains the data.
In the case of a simple linear regression (hence with one predictor only) the difference between the R2 and R2Adj is usually small – one would generally quote the R2. When fitting models with more than one predictor, one should refer to (and quote) the adjusted R2.