Research Methods ·
By Reviewed against primary style manuals — see our editorial process

Pearson Correlation Coefficient: Formula and Interpretation

The Pearson correlation coefficient (r) tells you how tightly two continuous variables track each other in a straight line. It runs from −1 to +1. Numbers at the edges mean a strong relationship; numbers near zero mean almost nothing linear is happening. Psychologists, doctors, economists — anyone who works with numerical data ends up needing it.

What is the Pearson correlation coefficient?

The Pearson product-moment correlation coefficient — usually just r — captures the strength and direction of a linear relationship between two variables. Karl Pearson worked it out in the 1890s, building on Francis Galton's earlier results. It's still the default correlation statistic in the social, behavioural, and life sciences.

Correlation isn't causation. A high r just means the two variables move together in a predictable straight-line pattern. It doesn't tell you which one is driving the other. It doesn't rule out a third variable pulling both strings.

The formula

Population value: ρ (rho). Sample estimate: r. Here's the sample formula.

Pearson r formula

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]

xᵢ and yᵢ are individual data points; and ȳ are the sample means; the sums run across all n observations. The top of that fraction is the covariance — how much x and y swing together. The bottom divides that by the product of the two standard deviations, so the result has no units and is locked between −1 and +1.

There's an equivalent raw-score version that can be quicker by hand.

Raw-score formula

r = [nΣxᵢyᵢ − (Σxᵢ)(Σyᵢ)] / √{[nΣxᵢ² − (Σxᵢ)²][nΣyᵢ² − (Σyᵢ)²]}

The two formulas spit out the same number. Nobody actually does this by hand any more. R, SPSS, SciPy, pandas — pick one and let it crunch.

Interpreting the range: −1 to +1

Because the denominator equals the largest the numerator can ever be, r can't escape its bounds.

r value Interpretation
+1.00Perfect positive linear relationship
+0.70 to +0.99Strong positive relationship
+0.40 to +0.69Moderate positive relationship
+0.10 to +0.39Weak positive relationship
0.00No linear relationship
−0.10 to −0.39Weak negative relationship
−0.40 to −0.69Moderate negative relationship
−0.70 to −0.99Strong negative relationship
−1.00Perfect negative linear relationship

These cutoffs are conventions, not rules. Personality psychologists get excited about r = 0.30. Physicists would call that nothing. Anything below 0.95 might look weak in an engineering lab. Read r against your field's norms and what your study is actually trying to show.

Positive, negative, and zero correlation

Positive correlation

Positive r: both variables rise together. One goes up, the other goes up. Height and weight. Hours spent studying and exam scores. Income and spending.

Negative correlation

Negative r: one rises, the other falls. Sleep deprivation hurts cognitive performance. Prices rise, demand drops. More stress, weaker immune function.

Zero or near-zero correlation

An r near zero just means there's no linear link between the variables. That isn't the same as no link at all. A perfect curve can hide behind r = 0 because Pearson can't see anything but straight lines. Look at a scatter plot before you call two variables unrelated.

Important: r = 0 can occur even when there is a perfect U-shaped (quadratic) relationship. Pearson's r only captures linear association. Always plot your data.

Scatter plots and visual interpretation

Scatter plots are simple: each observation becomes one dot, with one variable on the x-axis and the other on the y. The cloud of dots shows you direction and strength before you compute a single number.

Patterns worth knowing:

  • Tight upward-sloping ellipse: strong positive correlation (r close to +1)
  • Tight downward-sloping ellipse: strong negative correlation (r close to −1)
  • Circular or diffuse cloud: weak or zero correlation
  • U-shape or curved pattern: non-linear relationship that Pearson's r will underestimate
  • Single outlier far from the cloud: may inflate or deflate r substantially

Plot the data first. Always. One outlier or a hidden curve can warp r into something meaningless, and the picture will show you what the number can't.

r² — the coefficient of determination

Square r and you get , the coefficient of determination. It's the share of variance in one variable that the other variable statistically accounts for.

Example

If r = 0.70, then r² = 0.49 — meaning that 49% of the variance in Y is accounted for by X (or equivalently, by their linear relationship).

Honestly, is the more honest number. An r of 0.50 sounds decent. Then you square it and remember only 25% of the variance is explained. The remaining 75% is doing something else entirely, somewhere outside your model.

In simple linear regression, equals the your software prints. Stack more predictors on and becomes the multi-variable version of the same idea.

Assumptions of Pearson's r

Pearson's r works when the data behaves. Specifically:

  • Both variables are continuous — interval or ratio level of measurement.
  • Linear relationship — the relationship between the variables is approximately linear (check with a scatter plot).
  • Bivariate normality — for significance testing, both variables should be approximately normally distributed (or the sample should be large enough to invoke the central limit theorem).
  • No significant outliers — extreme outliers can distort r substantially.
  • Independence of observations — each pair of data points comes from a different participant or unit.
  • Homoscedasticity — the variance of Y should be roughly constant across all values of X.

Break those assumptions and r starts lying. Ordinal scales, heavily skewed data, an outlier or two with real pull — Spearman's rank correlation handles that mess better.

Pearson vs. Spearman correlation

Feature Pearson's r Spearman's ρ
Data type Continuous (interval/ratio) Ordinal, or continuous with violations
Relationship type Linear only Any monotonic relationship
Normality required? Yes (for significance testing) No
Outlier sensitivity High Low (works on ranks)
Computation Based on raw values Based on ranked values

Spearman's ρ ranks both variables first and runs the Pearson formula on the ranks. That trick makes it shrug off outliers and non-normal distributions, and it handles ordinal data like Likert-scale ratings without complaint. When in doubt, plot first. Clean linear cloud, no wild outliers: go with Pearson. Curved but monotonic, or a couple of points throwing things off: use Spearman.

Statistical significance of r

Want to know if an r is really different from zero? Run a t-test with n − 2 degrees of freedom.

t-statistic for r

t = r√(n − 2) / √(1 − r²)

A significant r (p < .05) means the result probably isn't a fluke of your sample. That isn't the same as mattering. Crank the sample size high enough and even r = 0.05 turns statistically significant while telling you basically nothing. Report the coefficient, the p-value, and a confidence interval together so readers can decide for themselves whether it matters.

Reporting in APA style: Report r(df) = value, p = value — for example, r(48) = .62, p < .001. The df equals n − 2.

Quick summary

Concept Key point
Range−1 to +1
DirectionPositive = same direction; negative = opposite direction
MagnitudeCloser to ±1 = stronger; closer to 0 = weaker
Proportion of shared variance; multiply r by itself
AssumesLinearity, continuous data, bivariate normality, no outliers
Use Spearman whenData are ordinal, non-normal, or contain outliers
Correlation ≠ causationA high r does not mean one variable causes the other

Writing a paper that involves correlation analysis? Use CiteGenie to locate peer-reviewed sources supporting your statistical claims.

Find Sources for Your Research