Research Methods ·
By Reviewed against primary style manuals — see our editorial process

Simple Linear Regression: Definition, Formula, and Examples

Simple linear regression models the relationship between one predictor variable and one outcome variable using a straight line. It is one of the most fundamental tools in statistics — underlying everything from clinical prediction models to economic forecasting. This guide covers the regression equation, how to interpret the slope and intercept, what R² tells you, how to analyze residuals, and what assumptions must hold for the results to be trustworthy.

What is simple linear regression?

Simple linear regression (SLR) is a statistical method for modeling the linear relationship between a single continuous predictor variable (X) and a single continuous outcome variable (Y). It produces an equation for a straight line that best fits the observed data, in the sense of minimizing prediction error.

Unlike correlation, which only quantifies the strength and direction of a relationship, regression produces a predictive model. Once you have the regression equation, you can plug in a new value of X and generate a predicted value of Y.

The regression equation

The population regression model is:

Population model

Y = α + βX + ε

Where α (alpha) is the population intercept, β (beta) is the population slope, and ε (epsilon) is the error term representing all variation in Y not explained by X.

The fitted (estimated) regression line, based on sample data, is written:

Sample regression equation

Ŷ = a + bX

Here Ŷ (Y-hat) is the predicted value of Y, a is the sample estimate of the intercept, and b is the sample estimate of the slope. These estimates are calculated from the data using ordinary least squares.

Interpreting slope and intercept

The slope (b)

The slope represents the expected change in Y for each one-unit increase in X. A slope of 2.5 means that for every additional unit of X, Y is predicted to increase by 2.5 units on average. A negative slope means Y decreases as X increases.

Example

Study time (hours) predicts exam score: Ŷ = 50 + 5X
Slope = 5: each additional hour of study is associated with a 5-point increase in the predicted exam score.

The intercept (a)

The intercept is the predicted value of Y when X = 0. It anchors the line on the y-axis. The intercept is often not directly interpretable — for example, "predicted exam score when study time is zero" may be outside the range of meaningful values. In such cases, the intercept is a mathematical necessity for positioning the line but should not be over-interpreted.

Units matter: The slope is expressed in units of Y per unit of X. If Y is weight in kilograms and X is height in centimeters, the slope is in kg/cm. Changing units (e.g., to pounds and inches) changes the numerical value of the slope but not the underlying relationship.

Ordinary least squares estimation

Ordinary least squares (OLS) finds the line that minimizes the sum of squared residuals — the squared vertical distances between each observed data point and the fitted line. The OLS formulas for the slope and intercept are:

OLS slope and intercept

b = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ(xᵢ − x̄)²
a = ȳ − b · x̄

The numerator of b is the same as the numerator of Pearson's r; the denominator is the sum of squared deviations in X. This is why the slope and the correlation coefficient are related: b = r · (SD_Y / SD_X).

R² — goodness of fit

(R-squared), the coefficient of determination, measures what proportion of the total variance in Y is explained by the regression model. It ranges from 0 to 1 (or 0% to 100%).

Formula

R² = 1 − (SS_residual / SS_total)
where SS_residual = Σ(yᵢ − ŷᵢ)² and SS_total = Σ(yᵢ − ȳ)²

In simple linear regression, R² equals the square of the Pearson correlation coefficient between X and Y. An R² of 0.64 means the model explains 64% of the variance in Y; the remaining 36% is unexplained.

R² always increases (or stays the same) when you add more predictors, even irrelevant ones. This is why adjusted R² is preferred in multiple regression — it penalizes for unnecessary predictors.

Residuals and residual analysis

A residual is the difference between an observed value and the value predicted by the regression line:

Residual formula

eᵢ = yᵢ − ŷᵢ

Residual analysis is the primary tool for checking regression assumptions. Key plots include:

  • Residuals vs. fitted values: should show a random scatter around zero with no pattern. A curved pattern indicates non-linearity; a funnel shape indicates heteroscedasticity.
  • Normal Q-Q plot of residuals: points should fall approximately on a diagonal line, indicating normality of residuals.
  • Scale-location plot: checks whether variance is constant across fitted values (homoscedasticity).
  • Residuals vs. leverage: identifies influential observations (high leverage points and outliers that exert large influence on the slope).

Assumptions

Valid inference from a simple linear regression requires the following assumptions (remembered with the acronym LINE):

  • Linearity: The relationship between X and Y is linear. Check with a scatter plot of X vs. Y and a residuals-vs-fitted plot.
  • Independence: Observations are independent of each other. Violated by time series data or clustered samples without appropriate correction.
  • Normality of residuals: The residuals are approximately normally distributed. Check with a Q-Q plot or a histogram of residuals. Less critical with large samples due to the central limit theorem.
  • Equal variance (homoscedasticity): The spread of residuals is constant across all levels of X. A funnel pattern in the residuals-vs-fitted plot signals heteroscedasticity, which inflates standard errors.
Heteroscedasticity fix: If residual variance increases with fitted values (common in income or count data), log-transforming Y often stabilizes the variance and restores homoscedasticity.

Worked example

A researcher studies whether hours of weekly exercise (X) predicts resting heart rate in beats per minute (Y) in a sample of 30 adults. After collecting data:

  • Mean exercise = 4.2 hrs/week; Mean heart rate = 72.5 bpm
  • OLS slope: b = −2.3 (each additional hour of exercise is associated with a 2.3 bpm decrease in resting heart rate)
  • OLS intercept: a = 82.2
  • Fitted equation: Ŷ = 82.2 − 2.3X
  • R² = 0.51 — exercise explains 51% of the variance in resting heart rate
  • F(1, 28) = 29.1, p < .001 — the overall model is statistically significant
Interpretation

A person exercising 5 hours per week has a predicted resting heart rate of:
Ŷ = 82.2 − 2.3(5) = 82.2 − 11.5 = 70.7 bpm

Interpreting regression output

Standard regression output (from R, SPSS, Python, or Stata) includes:

Output element What it tells you
Coefficient (b)Estimated slope — change in Y per unit change in X
Standard error (SE)Precision of the slope estimate; smaller = more precise
t-statisticb divided by SE; tests whether the slope differs from zero
p-valueProbability of observing this slope by chance if the true slope is zero
95% CI for bRange of plausible values for the true population slope
Proportion of variance in Y explained by X
F-statisticOverall model significance; in SLR, F = t²
Residual standard errorTypical prediction error in units of Y

Quick summary

Concept Key point
EquationŶ = a + bX
Slope (b)Change in Y for each one-unit increase in X
Intercept (a)Predicted Y when X = 0
OLSMinimizes sum of squared residuals
Proportion of variance in Y explained by the model (0–1)
ResidualsObserved minus predicted values; used to check assumptions
AssumptionsLinearity, independence, normality of residuals, homoscedasticity

Working on a study that uses regression analysis? CiteGenie can help you find peer-reviewed sources to support your methodology and results.

Find Sources for Your Research