Simple Linear Regression


Linear regression is applied to cases in which two variables are related. The main idea is when you are given a set of data points, your task is to find a line that best fit the data points. There are many ways to solve this, and the most effective way of finding the line of best fit is the least-squares method.


Some examples can be applied to finding the relationship between the age and height of a child, high school GPA and college GPA of students, describing the relationship between shoe size and hand size.

Simple Linear Regression

The simple linear regression is modeled as y = mx + b. Our input or predictor is denoted as x, y is our output or response, m is our slope and b is our y intercept

One of the many ways of finding the line of best fit is by least-squares method. This formula can be modeled by the function below:

def line_of_best_fit(x,y):
xm = mean(x)
ym = mean(y)

a = 0
b = 0

for i in range(len(x)):
  a += (x[i] - xm) * (y[i] * ym)
  b += (x[i] - xm)**2

m = a / b

y = ym - (m * xm)

return (m,y)

This function would return the slope and the y intercept respectively. Linear regression can help us logically interpret the relationship of your data, predict the output given some input, or find the probability of an output to occur.


As an example, let's look at the following linear model:

y = 0.46x - 3.63

This model describes the relationship between the weight of the brain and body of a mammal species in kilograms. In this formula, x will represent the weight of the body and y will represent the weight of the brain. From this model, we can interpret that on average, the weight of the brain will increase by 0.46% for every kilogram of the body. From this same model, we can also predict the weight of the mammals brain by the weight of the mammals body. From this model, we can also find the probability for a mammal to have a brain that weighs some value y if the mammals body weight is some value x

Residual variance

The residual variance describes the variability in the model. In other words, it tells us how much of the observation is spread from the line of best fit. A higher residual variance, the data is much more spread while a smaller residual variance are closer to the line of best fit.

To calculate the residual variance, we first have to calculate the error sum of squares. The model is represented in this function

sse = 0
for i in range(len(data)):
  y = data[i][0]
  x = data[i][1]
  sse += (y -(m * x + b))**2

# To estimate the residual variance:
r = sse / (n - 2)	

Coefficient of Determination

The coefficient of determination describes how much of the data influences the line of best fit. We can calculate this by first calculating the total sums of squares and the error sum of squares.

sst = 0
mean_of_y = mean(Y)

for y in Y
  sst +=(y - mean_of_y)**2

# To calculate the coefficient of determination 
r**2 = 1 - (sse / sst)

A coefficient of determination that is close to 1 means that the points all lie close to the regression line.


These are some of the few applications of linear regression. You can try to practice finding the line of best fit using some of the datasets from this website here