## Memo

Take and organize notes like text messages.

Multiple Linear Regression Analysis is just like Simple Linear Regression but with one or more predictors. For example, high school GPA of a college student may be used to predict the students success in college. However, we know that there are other factors that can better predict students college success. Multiple linear regression allows us to make such predictions.

The general additive multiple regression model equation is described as followed:

`y = Bo + B1x1 + B2x2 + ... + BnXn + error`

Where the standard error is denoted as epsilon

In multiple linear regression, parameter estimates are determined by the principle of least squares.

**Sum of Squared Error:**
`SSE += (y[i] - y_hat)**2`

where `y[i]`

is
the observed value while `y_hat`

is the predicted
value.

**Sum of Squared Regression:**
`SSR += (y_hat - mean_y)**2`

where
`y_hat`

is the predicted value and
`mean_y`

is the mean of the observations.

**Sum of Squared Totals:**
`SST += (y[i] - mean_y)**2`

where `y[i]`

is
the observed value and the `mean_y`

is the mean of the
observed values.

**Mean Squared Regression:**
`MSR = SSR / k`

where k is the number of predictors

**Mean Squared Error:**
`MSE = SSE / (n - (k + 1))`

where `k`

is the
number of predictors and `n`

is the number of
observations.

**Coefficient of Determination: **`1 - (sse / sst)`

. The coefficient of determination
describes how much of the data influences the line of best fit. A
number that is close to 1 means that the points all lie close to
the regression line.

Before we proceed to interpret the model, we first have to perform the Global F-test; to help us verify that the output would indeed depend on at least one of the variables. The significance of the global f-test has the following hypothesis:

```
Ho : B1 = B2 = ... = Bk = 0
Ha : At least one of Bj != 0
```

**Test Statistic** or **F-test:** for a
multiple linear regresion is as followed:
`F = MSR / MSE`

where the degrees of freedom is
`k`

and `n - (k + 1)`

. If the F value is
larger than the significance level (usually 0.05 if not
specified), we reject the null hypothesis; meaning that the
response is dependent on at least one of the predictors. Therefor,
we can proceed to interpret the multiple regresion model.

There can be instances in which one of the variables in a multiple
linear regression model will be categorical. Ideally we would
break up the `n`

categories into `n`

dummy
variables. For example, suppose we are trying to predict the
success of a college student based on GPA and major (Economics,
Math, and Computer Science). GPA will be denoted as
`x1`

and a student majoring in Economics, Math, and
Computer Science will be denoted as
`x2, x3, and x4`

respectively.

```
x2 = 1 if major is economics
x2 = 0 otherwise
x3 = 1 if major is math
x3 = 0 otherwise
x4 = 1 if major is computer science
x4 = 0 otherwise
```

Our model would look something like this
`y = B0 + B1x1 + B2x2 + B3x3 + B4x4`

SKLearn is a machine learning library for Python and R. It's known for its friendliness to beginners without the compromise for efficiency and accuracy.

Before we begin, make sure you have SKLearn installed in your machine. You can learn how to install SKLearn here

You will also need to install Python in your device. You can learn how to install Python here

Lastly, you'll need a good text editor or IDE. I personally use Jupyter Notebooks or Sublime Text 3; but you are free to use whatever works for you. If you want to tryout Jupyter notebooks, the installation guide is here or Sublime Text 3 is here

For a sample of `n = 20`

individuals, we have
measurements of `y = body fat`

,
`x1 = triceps skinfold thickness`

,
`x2 = thigh circumference`

, and
`x3 = midarm circumference`

.

First we import our python libraries

```
from sklearn.linear_model import LinearRegression
import numpy as np
```

The data will be using is listed below. Feel free to copy the data
or download the textfile here. Each row
in `X`

is an independent observation (predictors)
ordered by
```
[triceps skinfold thickness, thigh circumference, midarm
circumference]
```

and `y`

is our body fat (response).

```
X = np.array([[19.5, 43.1, 29.1],
[24.7, 49.8, 28.2],
[30.7, 51.9, 37.0],
[29.8, 54.3, 31.1],
[19.1, 42.2, 30.9],
[25.6, 53.9, 23.7],
[31.4, 58.5, 27.6],
[27.9, 52.1, 30.6],
[22.1, 49.9, 23.2],
[25.5, 53.5, 24.8],
[31.1, 56.6, 30.0],
[30.4, 56.7, 28.3],
[18.7, 46.5, 23.0],
[19.7, 44.2, 28.6],
[14.6, 42.7, 21.3],
[29.5, 54.4, 30.1],
[27.7, 55.3, 25.7],
[30.2, 58.6, 24.6],
[22.7, 48.2, 27.1],
[25.2, 51.0, 27.5]])
y = np.array([[11.9,22.8,18.7,20.1,12.9,21.7,27.1,25.4,21.3,19.3,25.4,27.2,11.7,17.8,12.8,23.9,22.6,25.4,14.8,21.1]]).T
```

Using SKLearn, we want to use the
`LinearRegression()`

function and fit the model.

`regression_model = LinearRegression().fit(X,y)`

From our `regression_model`

, we can print out our
intercepts and coefficients to complete our multiple linear
regression model.

```
B0 = regression_model.intercept_
B = regression_model.coef_
print(B0, B)
```

After running your code, your intercept and coefficients should look something like this

`[117.08469478] [[ 4.33409201 -2.85684794 -2.18606025]]`

We can also use the `predict()`

function to predict the
individuals body fat by inputing values for triceps skinfold
thickness, thigh circumference, midarm circumference.

```
prediction = reg.predict([[19.5, 43.1, 29.1]])
print(prediction)
```

`[[20.21884106]]`

Before we interpret the model, lets perform the Global F-test to
verify that predicting the body fat would indeed depend on at
least one of the predictors. Recall the
**Parameter Estimations** above. We need to calculate
the sum of squares before calculating the f value. Create a
function that will calculate the sum of squares.

```
def get_sum_of_squares():
ssr = 0
sse = 0
sst = 0
mean_y = np.mean(y)
for i in range(20):
y_hat = regression_model.predict([X[i]])
ssr += (y_hat - mean_y)**2
sse += (y[i][0] - y_hat)**2
sst += (y[i][0] - mean_y)**2
return (ssr, sse, sst)
sum_of_squares = get_sum_of_squares()
```

If you want to check your function to see if you get the right numbers, this is my output.

`ssr = 396.98461183, sse = 98.40488817, sst = 495.3894999999999 # Don't worry about format`

Create a function to calculate the global F-test

```
def global_f_test():
n = len(yy)
k = 3
ssr = ss[0]
sse = ss[1]
msr = ssr / k
mse = sse / (n - (k + 1))
return msr / mse
```

Your global F-test should be: `[[21.5157123]]`

which is
significantly higher than the significant value (it was not
specified so by default, our significant value is 0.05). Therefor,
we can proceed to interpret the model.

From **Step 3, **we know the y-intercept is
`117.08469478`

and our `b1`

,`b2`

,
and `b3`

are `4.33409201`

,
`-2.85684794`

, and
`-2.18606025`

respectively. Our fitted model would look
like this:

`y = 4.33409201(x1) + -2.85684794(x2) + -2.18606025(x3) + 117.08469478`

Here we can interpret that

• For every unit increase of the
thickness of an individuals triceps, we expect an increase in body
fat by 4.33%

• For every unit increase of the thickness of an
individuals thighs, we expect a decrease in body fat by -2.85%

• For every unit increase of the thickness of an individuals
midarm, we expect a decrease in body fat by -2.19%

Just for practice, we can calculate the coefficient of
determination. Recall the coefficient of determination formula
under **Parameter Estimation**.

```
def coefficient_of_determination():
sse = ss[1]
sst = ss[2]
return 1 - (sse / sst)
```

`coefficient of determination = [[0.80135855]]`

The value is close to 1, therefor we can say that most of the points lie close to the regression line.

Sum of Squares Formula from 365datascience.com

F-test for Linear Regression from DePaul University Steve D. Jost

Problem from PennState