## Memo

Take and organize notes like text messages.

Linear regression is applied to cases in which two variables are related. The main idea is when you are given a set of data points, your task is to find a line that best fit the data points. There are many ways to solve this, and the most effective way of finding the line of best fit is the least-squares method.

Some examples can be applied to finding the relationship between the age and height of a child, high school GPA and college GPA of students, describing the relationship between shoe size and hand size.

The simple linear regression is modeled as
`y = mx + b`

. Our input or predictor is denoted as
`x`

, `y`

is our output or response,
`m`

is our slope and `b`

is our
`y intercept`

One of the many ways of finding the line of best fit is by least-squares method. This formula can be modeled by the function below:

```
def line_of_best_fit(x,y):
xm = mean(x)
ym = mean(y)
a = 0
b = 0
for i in range(len(x)):
a += (x[i] - xm) * (y[i] * ym)
b += (x[i] - xm)**2
m = a / b
y = ym - (m * xm)
return (m,y)
```

This function would return the `slope`

and the
`y intercept`

respectively. Linear regression can help
us logically interpret the relationship of your data, predict the
output given some input, or find the probability of an output to
occur.

As an example, let's look at the following linear model:

**y = 0.46x - 3.63**

This model describes the relationship between the weight of the
brain and body of a mammal species in kilograms. In this formula,
`x`

will represent the weight of the body and
`y`

will represent the weight of the brain. From this
model, we can interpret that on average, the weight of the brain
will increase by 0.46% for every kilogram of the body. From this
same model, we can also predict the weight of the mammals brain by
the weight of the mammals body. From this model, we can also find
the probability for a mammal to have a brain that weighs some
value `y`

if the mammals body weight is some value
`x`

The residual variance describes the variability in the model. In other words, it tells us how much of the observation is spread from the line of best fit. A higher residual variance, the data is much more spread while a smaller residual variance are closer to the line of best fit.

To calculate the residual variance, we first have to calculate the
`error sum of squares`

. The model is represented in
this function

```
sse = 0
for i in range(len(data)):
y = data[i][0]
x = data[i][1]
sse += (y -(m * x + b))**2
# To estimate the residual variance:
r = sse / (n - 2)
```

The coefficient of determination describes how much of the data
influences the line of best fit. We can calculate this by first
calculating the `total sums of squares`

and the
`error sum of squares`

.

```
sst = 0
mean_of_y = mean(Y)
for y in Y
sst +=(y - mean_of_y)**2
# To calculate the coefficient of determination
r**2 = 1 - (sse / sst)
```

**A coefficient of determination that is close to 1 means that
the points all lie close to the regression line.
**

These are some of the few applications of linear regression. You can try to practice finding the line of best fit using some of the datasets from this website here