Simple Linear Regression: Definition, Formula, and Examples
Simple linear regression models the relationship between one predictor variable and one outcome variable using a straight line. It is one of the most fundamental tools in statistics — underlying everything from clinical prediction models to economic forecasting. This guide covers the regression equation, how to interpret the slope and intercept, what R² tells you, how to analyze residuals, and what assumptions must hold for the results to be trustworthy.
What is simple linear regression?
Simple linear regression (SLR) is a statistical method for modeling the linear relationship between a single continuous predictor variable (X) and a single continuous outcome variable (Y). It produces an equation for a straight line that best fits the observed data, in the sense of minimizing prediction error.
Unlike correlation, which only quantifies the strength and direction of a relationship, regression produces a predictive model. Once you have the regression equation, you can plug in a new value of X and generate a predicted value of Y.
The regression equation
The population regression model is:
Y = α + βX + ε
Where α (alpha) is the population intercept, β (beta) is the population slope, and ε (epsilon) is the error term representing all variation in Y not explained by X.
The fitted (estimated) regression line, based on sample data, is written:
Ŷ = a + bX
Here Ŷ (Y-hat) is the predicted value of Y, a is the sample estimate of the intercept, and b is the sample estimate of the slope. These estimates are calculated from the data using ordinary least squares.
Interpreting slope and intercept
The slope (b)
The slope represents the expected change in Y for each one-unit increase in X. A slope of 2.5 means that for every additional unit of X, Y is predicted to increase by 2.5 units on average. A negative slope means Y decreases as X increases.
Study time (hours) predicts exam score: Ŷ = 50 + 5X
Slope = 5: each additional hour of study is associated with a 5-point increase in the predicted exam score.
The intercept (a)
The intercept is the predicted value of Y when X = 0. It anchors the line on the y-axis. The intercept is often not directly interpretable — for example, "predicted exam score when study time is zero" may be outside the range of meaningful values. In such cases, the intercept is a mathematical necessity for positioning the line but should not be over-interpreted.
Ordinary least squares estimation
Ordinary least squares (OLS) finds the line that minimizes the sum of squared residuals — the squared vertical distances between each observed data point and the fitted line. The OLS formulas for the slope and intercept are:
b = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ(xᵢ − x̄)²
a = ȳ − b · x̄
The numerator of b is the same as the numerator of Pearson's r; the denominator is the sum of squared deviations in X. This is why the slope and the correlation coefficient are related: b = r · (SD_Y / SD_X).
R² — goodness of fit
R² (R-squared), the coefficient of determination, measures what proportion of the total variance in Y is explained by the regression model. It ranges from 0 to 1 (or 0% to 100%).
R² = 1 − (SS_residual / SS_total)
where SS_residual = Σ(yᵢ − ŷᵢ)² and SS_total = Σ(yᵢ − ȳ)²
In simple linear regression, R² equals the square of the Pearson correlation coefficient between X and Y. An R² of 0.64 means the model explains 64% of the variance in Y; the remaining 36% is unexplained.
R² always increases (or stays the same) when you add more predictors, even irrelevant ones. This is why adjusted R² is preferred in multiple regression — it penalizes for unnecessary predictors.
Residuals and residual analysis
A residual is the difference between an observed value and the value predicted by the regression line:
eᵢ = yᵢ − ŷᵢ
Residual analysis is the primary tool for checking regression assumptions. Key plots include:
- Residuals vs. fitted values: should show a random scatter around zero with no pattern. A curved pattern indicates non-linearity; a funnel shape indicates heteroscedasticity.
- Normal Q-Q plot of residuals: points should fall approximately on a diagonal line, indicating normality of residuals.
- Scale-location plot: checks whether variance is constant across fitted values (homoscedasticity).
- Residuals vs. leverage: identifies influential observations (high leverage points and outliers that exert large influence on the slope).
Assumptions
Valid inference from a simple linear regression requires the following assumptions (remembered with the acronym LINE):
- Linearity: The relationship between X and Y is linear. Check with a scatter plot of X vs. Y and a residuals-vs-fitted plot.
- Independence: Observations are independent of each other. Violated by time series data or clustered samples without appropriate correction.
- Normality of residuals: The residuals are approximately normally distributed. Check with a Q-Q plot or a histogram of residuals. Less critical with large samples due to the central limit theorem.
- Equal variance (homoscedasticity): The spread of residuals is constant across all levels of X. A funnel pattern in the residuals-vs-fitted plot signals heteroscedasticity, which inflates standard errors.
Worked example
A researcher studies whether hours of weekly exercise (X) predicts resting heart rate in beats per minute (Y) in a sample of 30 adults. After collecting data:
- Mean exercise = 4.2 hrs/week; Mean heart rate = 72.5 bpm
- OLS slope: b = −2.3 (each additional hour of exercise is associated with a 2.3 bpm decrease in resting heart rate)
- OLS intercept: a = 82.2
- Fitted equation: Ŷ = 82.2 − 2.3X
- R² = 0.51 — exercise explains 51% of the variance in resting heart rate
- F(1, 28) = 29.1, p < .001 — the overall model is statistically significant
A person exercising 5 hours per week has a predicted resting heart rate of:
Ŷ = 82.2 − 2.3(5) = 82.2 − 11.5 = 70.7 bpm
Interpreting regression output
Standard regression output (from R, SPSS, Python, or Stata) includes:
| Output element | What it tells you |
|---|---|
| Coefficient (b) | Estimated slope — change in Y per unit change in X |
| Standard error (SE) | Precision of the slope estimate; smaller = more precise |
| t-statistic | b divided by SE; tests whether the slope differs from zero |
| p-value | Probability of observing this slope by chance if the true slope is zero |
| 95% CI for b | Range of plausible values for the true population slope |
| R² | Proportion of variance in Y explained by X |
| F-statistic | Overall model significance; in SLR, F = t² |
| Residual standard error | Typical prediction error in units of Y |
Quick summary
| Concept | Key point |
|---|---|
| Equation | Ŷ = a + bX |
| Slope (b) | Change in Y for each one-unit increase in X |
| Intercept (a) | Predicted Y when X = 0 |
| OLS | Minimizes sum of squared residuals |
| R² | Proportion of variance in Y explained by the model (0–1) |
| Residuals | Observed minus predicted values; used to check assumptions |
| Assumptions | Linearity, independence, normality of residuals, homoscedasticity |
Working on a study that uses regression analysis? CiteGenie can help you find peer-reviewed sources to support your methodology and results.
Find Sources for Your Research