Multiple Linear Regression with SKLearn

Introduction

Multiple Linear Regression Analysis is just like Simple Linear Regression but with one or more predictors. For example, high school GPA of a college student may be used to predict the students success in college. However, we know that there are other factors that can better predict students college success. Multiple linear regression allows us to make such predictions.

The general additive multiple regression model equation is described as followed:

y = Bo + B1x1 + B2x2 + ... + BnXn + error

Where the standard error is denoted as epsilon


Parameter Estimation

In multiple linear regression, parameter estimates are determined by the principle of least squares.

Sum of Squared Error: SSE += (y[i] - y_hat)**2 where y[i] is the observed value while y_hat is the predicted value.

Sum of Squared Regression: SSR += (y_hat - mean_y)**2 where y_hat is the predicted value and mean_y is the mean of the observations.

Sum of Squared Totals: SST += (y[i] - mean_y)**2 where y[i] is the observed value and the mean_y is the mean of the observed values.

Mean Squared Regression: MSR = SSR / k where k is the number of predictors

Mean Squared Error: MSE = SSE / (n - (k + 1)) where k is the number of predictors and n is the number of observations.

Coefficient of Determination: 1 - (sse / sst). The coefficient of determination describes how much of the data influences the line of best fit. A number that is close to 1 means that the points all lie close to the regression line.


Global F-Test

Before we proceed to interpret the model, we first have to perform the Global F-test; to help us verify that the output would indeed depend on at least one of the variables. The significance of the global f-test has the following hypothesis:

Ho : B1 = B2 = ... = Bk = 0
Ha : At least one of Bj != 0

Test Statistic or F-test: for a multiple linear regresion is as followed: F = MSR / MSE where the degrees of freedom is k and n - (k + 1). If the F value is larger than the significance level (usually 0.05 if not specified), we reject the null hypothesis; meaning that the response is dependent on at least one of the predictors. Therefor, we can proceed to interpret the multiple regresion model.


Categorical Interval

There can be instances in which one of the variables in a multiple linear regression model will be categorical. Ideally we would break up the n categories into n dummy variables. For example, suppose we are trying to predict the success of a college student based on GPA and major (Economics, Math, and Computer Science). GPA will be denoted as x1 and a student majoring in Economics, Math, and Computer Science will be denoted as x2, x3, and x4 respectively.

x2 = 1 if major is economics 
x2 = 0 otherwise

x3 = 1 if major is math 
x3 = 0 otherwise 

x4 = 1 if major is computer science 
x4 = 0 otherwise

Our model would look something like this y = B0 + B1x1 + B2x2 + B3x3 + B4x4


SKLearn

SKLearn is a machine learning library for Python and R. It's known for its friendliness to beginners without the compromise for efficiency and accuracy.

Installation

Before we begin, make sure you have SKLearn installed in your machine. You can learn how to install SKLearn here

You will also need to install Python in your device. You can learn how to install Python here

Lastly, you'll need a good text editor or IDE. I personally use Jupyter Notebooks or Sublime Text 3; but you are free to use whatever works for you. If you want to tryout Jupyter notebooks, the installation guide is here or Sublime Text 3 is here


Problem

For a sample of n = 20 individuals, we have measurements of y = body fat, x1 = triceps skinfold thickness, x2 = thigh circumference, and x3 = midarm circumference.

Step 0)

First we import our python libraries

from sklearn.linear_model import LinearRegression   
import numpy as np

Step 1)

The data will be using is listed below. Feel free to copy the data or download the textfile here. Each row in X is an independent observation (predictors) ordered by [triceps skinfold thickness, thigh circumference, midarm circumference] and y is our body fat (response).

X = np.array([[19.5, 43.1, 29.1],
                    [24.7, 49.8, 28.2],
                    [30.7, 51.9, 37.0],
                    [29.8, 54.3, 31.1],
                    [19.1, 42.2, 30.9],
                    [25.6, 53.9, 23.7],
                    [31.4, 58.5, 27.6],
                    [27.9, 52.1, 30.6],
                    [22.1, 49.9, 23.2],
                    [25.5, 53.5, 24.8],
                    [31.1, 56.6, 30.0],
                    [30.4, 56.7, 28.3],
                    [18.7, 46.5, 23.0],
                    [19.7, 44.2, 28.6],
                    [14.6, 42.7, 21.3],
                    [29.5, 54.4, 30.1],
                    [27.7, 55.3, 25.7],
                    [30.2, 58.6, 24.6],
                    [22.7, 48.2, 27.1],
                    [25.2, 51.0, 27.5]])
y = np.array([[11.9,22.8,18.7,20.1,12.9,21.7,27.1,25.4,21.3,19.3,25.4,27.2,11.7,17.8,12.8,23.9,22.6,25.4,14.8,21.1]]).T

Step 2)

Using SKLearn, we want to use the LinearRegression() function and fit the model.

regression_model = LinearRegression().fit(X,y)

Step 3)

From our regression_model, we can print out our intercepts and coefficients to complete our multiple linear regression model.

B0 = regression_model.intercept_
B = regression_model.coef_
print(B0, B)

After running your code, your intercept and coefficients should look something like this

Output
[117.08469478] [[ 4.33409201 -2.85684794 -2.18606025]]

We can also use the predict() function to predict the individuals body fat by inputing values for triceps skinfold thickness, thigh circumference, midarm circumference.

prediction = reg.predict([[19.5, 43.1, 29.1]])
print(prediction)
Output
[[20.21884106]]

Step 4)

Before we interpret the model, lets perform the Global F-test to verify that predicting the body fat would indeed depend on at least one of the predictors. Recall the Parameter Estimations above. We need to calculate the sum of squares before calculating the f value. Create a function that will calculate the sum of squares.

def get_sum_of_squares():
    ssr = 0
    sse = 0
    sst = 0

    mean_y = np.mean(y)

    for i in range(20):
        y_hat = regression_model.predict([X[i]])

        ssr += (y_hat - mean_y)**2
        sse += (y[i][0] - y_hat)**2
        sst += (y[i][0] - mean_y)**2

    return (ssr, sse, sst)

sum_of_squares = get_sum_of_squares()

If you want to check your function to see if you get the right numbers, this is my output.

Output
ssr = 396.98461183, sse = 98.40488817, sst = 495.3894999999999 # Don't worry about format

Step 5)

Create a function to calculate the global F-test

def global_f_test():
    n = len(yy)
    k = 3
    ssr = ss[0]
    sse = ss[1]

    msr = ssr / k
    mse = sse / (n - (k + 1))

    return msr / mse

Your global F-test should be: [[21.5157123]] which is significantly higher than the significant value (it was not specified so by default, our significant value is 0.05). Therefor, we can proceed to interpret the model.

Step 6)

From Step 3, we know the y-intercept is 117.08469478 and our b1,b2, and b3 are 4.33409201, -2.85684794, and -2.18606025 respectively. Our fitted model would look like this:

y = 4.33409201(x1) + -2.85684794(x2) + -2.18606025(x3) + 117.08469478

Here we can interpret that
• For every unit increase of the thickness of an individuals triceps, we expect an increase in body fat by 4.33%
• For every unit increase of the thickness of an individuals thighs, we expect a decrease in body fat by -2.85%
• For every unit increase of the thickness of an individuals midarm, we expect a decrease in body fat by -2.19%

Optional Step 7)

Just for practice, we can calculate the coefficient of determination. Recall the coefficient of determination formula under Parameter Estimation.

def coefficient_of_determination():
    sse = ss[1]
    sst = ss[2]
    return 1 - (sse / sst)
Output
coefficient of determination = [[0.80135855]]

The value is close to 1, therefor we can say that most of the points lie close to the regression line.


References

Sum of Squares Formula from 365datascience.com

F-test for Linear Regression from DePaul University Steve D. Jost

Problem from PennState