Pearson Correlation Coefficient: Formula and Interpretation
The Pearson correlation coefficient (r) tells you how tightly two continuous variables track each other in a straight line. It runs from −1 to +1. Numbers at the edges mean a strong relationship; numbers near zero mean almost nothing linear is happening. Psychologists, doctors, economists — anyone who works with numerical data ends up needing it.
What is the Pearson correlation coefficient?
The Pearson product-moment correlation coefficient — usually just r — captures the strength and direction of a linear relationship between two variables. Karl Pearson worked it out in the 1890s, building on Francis Galton's earlier results. It's still the default correlation statistic in the social, behavioural, and life sciences.
Correlation isn't causation. A high r just means the two variables move together in a predictable straight-line pattern. It doesn't tell you which one is driving the other. It doesn't rule out a third variable pulling both strings.
The formula
Population value: ρ (rho). Sample estimate: r. Here's the sample formula.
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]
xᵢ and yᵢ are individual data points; x̄ and ȳ are the sample means; the sums run across all n observations. The top of that fraction is the covariance — how much x and y swing together. The bottom divides that by the product of the two standard deviations, so the result has no units and is locked between −1 and +1.
There's an equivalent raw-score version that can be quicker by hand.
r = [nΣxᵢyᵢ − (Σxᵢ)(Σyᵢ)] / √{[nΣxᵢ² − (Σxᵢ)²][nΣyᵢ² − (Σyᵢ)²]}
The two formulas spit out the same number. Nobody actually does this by hand any more. R, SPSS, SciPy, pandas — pick one and let it crunch.
Interpreting the range: −1 to +1
Because the denominator equals the largest the numerator can ever be, r can't escape its bounds.
| r value | Interpretation |
|---|---|
| +1.00 | Perfect positive linear relationship |
| +0.70 to +0.99 | Strong positive relationship |
| +0.40 to +0.69 | Moderate positive relationship |
| +0.10 to +0.39 | Weak positive relationship |
| 0.00 | No linear relationship |
| −0.10 to −0.39 | Weak negative relationship |
| −0.40 to −0.69 | Moderate negative relationship |
| −0.70 to −0.99 | Strong negative relationship |
| −1.00 | Perfect negative linear relationship |
These cutoffs are conventions, not rules. Personality psychologists get excited about r = 0.30. Physicists would call that nothing. Anything below 0.95 might look weak in an engineering lab. Read r against your field's norms and what your study is actually trying to show.
Positive, negative, and zero correlation
Positive correlation
Positive r: both variables rise together. One goes up, the other goes up. Height and weight. Hours spent studying and exam scores. Income and spending.
Negative correlation
Negative r: one rises, the other falls. Sleep deprivation hurts cognitive performance. Prices rise, demand drops. More stress, weaker immune function.
Zero or near-zero correlation
An r near zero just means there's no linear link between the variables. That isn't the same as no link at all. A perfect curve can hide behind r = 0 because Pearson can't see anything but straight lines. Look at a scatter plot before you call two variables unrelated.
Scatter plots and visual interpretation
Scatter plots are simple: each observation becomes one dot, with one variable on the x-axis and the other on the y. The cloud of dots shows you direction and strength before you compute a single number.
Patterns worth knowing:
- Tight upward-sloping ellipse: strong positive correlation (r close to +1)
- Tight downward-sloping ellipse: strong negative correlation (r close to −1)
- Circular or diffuse cloud: weak or zero correlation
- U-shape or curved pattern: non-linear relationship that Pearson's r will underestimate
- Single outlier far from the cloud: may inflate or deflate r substantially
Plot the data first. Always. One outlier or a hidden curve can warp r into something meaningless, and the picture will show you what the number can't.
r² — the coefficient of determination
Square r and you get r², the coefficient of determination. It's the share of variance in one variable that the other variable statistically accounts for.
If r = 0.70, then r² = 0.49 — meaning that 49% of the variance in Y is accounted for by X (or equivalently, by their linear relationship).
Honestly, r² is the more honest number. An r of 0.50 sounds decent. Then you square it and remember only 25% of the variance is explained. The remaining 75% is doing something else entirely, somewhere outside your model.
In simple linear regression, r² equals the R² your software prints. Stack more predictors on and R² becomes the multi-variable version of the same idea.
Assumptions of Pearson's r
Pearson's r works when the data behaves. Specifically:
- Both variables are continuous — interval or ratio level of measurement.
- Linear relationship — the relationship between the variables is approximately linear (check with a scatter plot).
- Bivariate normality — for significance testing, both variables should be approximately normally distributed (or the sample should be large enough to invoke the central limit theorem).
- No significant outliers — extreme outliers can distort r substantially.
- Independence of observations — each pair of data points comes from a different participant or unit.
- Homoscedasticity — the variance of Y should be roughly constant across all values of X.
Break those assumptions and r starts lying. Ordinal scales, heavily skewed data, an outlier or two with real pull — Spearman's rank correlation handles that mess better.
Pearson vs. Spearman correlation
| Feature | Pearson's r | Spearman's ρ |
|---|---|---|
| Data type | Continuous (interval/ratio) | Ordinal, or continuous with violations |
| Relationship type | Linear only | Any monotonic relationship |
| Normality required? | Yes (for significance testing) | No |
| Outlier sensitivity | High | Low (works on ranks) |
| Computation | Based on raw values | Based on ranked values |
Spearman's ρ ranks both variables first and runs the Pearson formula on the ranks. That trick makes it shrug off outliers and non-normal distributions, and it handles ordinal data like Likert-scale ratings without complaint. When in doubt, plot first. Clean linear cloud, no wild outliers: go with Pearson. Curved but monotonic, or a couple of points throwing things off: use Spearman.
Statistical significance of r
Want to know if an r is really different from zero? Run a t-test with n − 2 degrees of freedom.
t = r√(n − 2) / √(1 − r²)
A significant r (p < .05) means the result probably isn't a fluke of your sample. That isn't the same as mattering. Crank the sample size high enough and even r = 0.05 turns statistically significant while telling you basically nothing. Report the coefficient, the p-value, and a confidence interval together so readers can decide for themselves whether it matters.
Quick summary
| Concept | Key point |
|---|---|
| Range | −1 to +1 |
| Direction | Positive = same direction; negative = opposite direction |
| Magnitude | Closer to ±1 = stronger; closer to 0 = weaker |
| r² | Proportion of shared variance; multiply r by itself |
| Assumes | Linearity, continuous data, bivariate normality, no outliers |
| Use Spearman when | Data are ordinal, non-normal, or contain outliers |
| Correlation ≠ causation | A high r does not mean one variable causes the other |
Writing a paper that involves correlation analysis? Use CiteGenie to locate peer-reviewed sources supporting your statistical claims.
Find Sources for Your Research