# Multiple Linear Regression with SKLearn

## Introduction

Multiple Linear Regression Analysis is just like Simple Linear Regression but with one or more predictors. For example, high school GPA of a college student may be used to predict the students success in college. However, we know that there are other factors that can better predict students college success. Multiple linear regression allows us to make such predictions.

The general additive multiple regression model equation is described as followed:

``y = Bo + B1x1 + B2x2 + ... + BnXn + error``

Where the standard error is denoted as epsilon

## Parameter Estimation

In multiple linear regression, parameter estimates are determined by the principle of least squares.

Sum of Squared Error: `SSE += (y[i] - y_hat)**2` where `y[i]` is the observed value while `y_hat` is the predicted value.

Sum of Squared Regression: `SSR += (y_hat - mean_y)**2` where `y_hat` is the predicted value and `mean_y` is the mean of the observations.

Sum of Squared Totals: `SST += (y[i] - mean_y)**2` where `y[i]` is the observed value and the `mean_y` is the mean of the observed values.

Mean Squared Regression: `MSR = SSR / k` where k is the number of predictors

Mean Squared Error: `MSE = SSE / (n - (k + 1))` where `k` is the number of predictors and `n` is the number of observations.

Coefficient of Determination: `1 - (sse / sst)`. The coefficient of determination describes how much of the data influences the line of best fit. A number that is close to 1 means that the points all lie close to the regression line.

## Global F-Test

Before we proceed to interpret the model, we first have to perform the Global F-test; to help us verify that the output would indeed depend on at least one of the variables. The significance of the global f-test has the following hypothesis:

``````Ho : B1 = B2 = ... = Bk = 0
Ha : At least one of Bj != 0
``````

Test Statistic or F-test: for a multiple linear regresion is as followed: `F = MSR / MSE` where the degrees of freedom is `k` and `n - (k + 1)`. If the F value is larger than the significance level (usually 0.05 if not specified), we reject the null hypothesis; meaning that the response is dependent on at least one of the predictors. Therefor, we can proceed to interpret the multiple regresion model.

## Categorical Interval

There can be instances in which one of the variables in a multiple linear regression model will be categorical. Ideally we would break up the `n` categories into `n` dummy variables. For example, suppose we are trying to predict the success of a college student based on GPA and major (Economics, Math, and Computer Science). GPA will be denoted as `x1` and a student majoring in Economics, Math, and Computer Science will be denoted as `x2, x3, and x4` respectively.

``````x2 = 1 if major is economics
x2 = 0 otherwise

x3 = 1 if major is math
x3 = 0 otherwise

x4 = 1 if major is computer science
x4 = 0 otherwise
``````

Our model would look something like this `y = B0 + B1x1 + B2x2 + B3x3 + B4x4`

## SKLearn

SKLearn is a machine learning library for Python and R. It's known for its friendliness to beginners without the compromise for efficiency and accuracy.

#### Installation

Before we begin, make sure you have SKLearn installed in your machine. You can learn how to install SKLearn here

You will also need to install Python in your device. You can learn how to install Python here

Lastly, you'll need a good text editor or IDE. I personally use Jupyter Notebooks or Sublime Text 3; but you are free to use whatever works for you. If you want to tryout Jupyter notebooks, the installation guide is here or Sublime Text 3 is here

## Problem

For a sample of `n = 20` individuals, we have measurements of `y = body fat`, `x1 = triceps skinfold thickness`, `x2 = thigh circumference`, and `x3 = midarm circumference`.

#### Step 0)

First we import our python libraries

``````from sklearn.linear_model import LinearRegression
import numpy as np``````

#### Step 1)

The data will be using is listed below. Feel free to copy the data or download the textfile here. Each row in `X` is an independent observation (predictors) ordered by ```[triceps skinfold thickness, thigh circumference, midarm circumference]``` and `y` is our body fat (response).

``````X = np.array([[19.5, 43.1, 29.1],
[24.7, 49.8, 28.2],
[30.7, 51.9, 37.0],
[29.8, 54.3, 31.1],
[19.1, 42.2, 30.9],
[25.6, 53.9, 23.7],
[31.4, 58.5, 27.6],
[27.9, 52.1, 30.6],
[22.1, 49.9, 23.2],
[25.5, 53.5, 24.8],
[31.1, 56.6, 30.0],
[30.4, 56.7, 28.3],
[18.7, 46.5, 23.0],
[19.7, 44.2, 28.6],
[14.6, 42.7, 21.3],
[29.5, 54.4, 30.1],
[27.7, 55.3, 25.7],
[30.2, 58.6, 24.6],
[22.7, 48.2, 27.1],
[25.2, 51.0, 27.5]])
y = np.array([[11.9,22.8,18.7,20.1,12.9,21.7,27.1,25.4,21.3,19.3,25.4,27.2,11.7,17.8,12.8,23.9,22.6,25.4,14.8,21.1]]).T``````

#### Step 2)

Using SKLearn, we want to use the `LinearRegression()` function and fit the model.

``regression_model = LinearRegression().fit(X,y)``

#### Step 3)

From our `regression_model`, we can print out our intercepts and coefficients to complete our multiple linear regression model.

``````B0 = regression_model.intercept_
B = regression_model.coef_
print(B0, B)``````

After running your code, your intercept and coefficients should look something like this

##### Output
``[117.08469478] [[ 4.33409201 -2.85684794 -2.18606025]]``

We can also use the `predict()` function to predict the individuals body fat by inputing values for triceps skinfold thickness, thigh circumference, midarm circumference.

``````prediction = reg.predict([[19.5, 43.1, 29.1]])
print(prediction)``````
##### Output
``[[20.21884106]]``

#### Step 4)

Before we interpret the model, lets perform the Global F-test to verify that predicting the body fat would indeed depend on at least one of the predictors. Recall the Parameter Estimations above. We need to calculate the sum of squares before calculating the f value. Create a function that will calculate the sum of squares.

``````def get_sum_of_squares():
ssr = 0
sse = 0
sst = 0

mean_y = np.mean(y)

for i in range(20):
y_hat = regression_model.predict([X[i]])

ssr += (y_hat - mean_y)**2
sse += (y[i][0] - y_hat)**2
sst += (y[i][0] - mean_y)**2

return (ssr, sse, sst)

sum_of_squares = get_sum_of_squares()``````

If you want to check your function to see if you get the right numbers, this is my output.

##### Output
``ssr = 396.98461183, sse = 98.40488817, sst = 495.3894999999999 # Don't worry about format``

#### Step 5)

Create a function to calculate the global F-test

``````def global_f_test():
n = len(yy)
k = 3
ssr = ss[0]
sse = ss[1]

msr = ssr / k
mse = sse / (n - (k + 1))

return msr / mse``````

Your global F-test should be: `[[21.5157123]]` which is significantly higher than the significant value (it was not specified so by default, our significant value is 0.05). Therefor, we can proceed to interpret the model.

#### Step 6)

From Step 3, we know the y-intercept is `117.08469478` and our `b1`,`b2`, and `b3` are `4.33409201`, `-2.85684794`, and `-2.18606025` respectively. Our fitted model would look like this:

``y = 4.33409201(x1) + -2.85684794(x2) + -2.18606025(x3) + 117.08469478``

Here we can interpret that
• For every unit increase of the thickness of an individuals triceps, we expect an increase in body fat by 4.33%
• For every unit increase of the thickness of an individuals thighs, we expect a decrease in body fat by -2.85%
• For every unit increase of the thickness of an individuals midarm, we expect a decrease in body fat by -2.19%

#### Optional Step 7)

Just for practice, we can calculate the coefficient of determination. Recall the coefficient of determination formula under Parameter Estimation.

``````def coefficient_of_determination():
sse = ss[1]
sst = ss[2]
return 1 - (sse / sst)``````
##### Output
``coefficient of determination = [[0.80135855]]``

The value is close to 1, therefor we can say that most of the points lie close to the regression line.

## References

Sum of Squares Formula from 365datascience.com

F-test for Linear Regression from DePaul University Steve D. Jost

Problem from PennState