Take and organize notes like text messages.
Multiple Linear Regression Analysis is just like Simple Linear Regression but with one or more predictors. For example, high school GPA of a college student may be used to predict the students success in college. However, we know that there are other factors that can better predict students college success. Multiple linear regression allows us to make such predictions.
The general additive multiple regression model equation is described as followed:
y = Bo + B1x1 + B2x2 + ... + BnXn + error
Where the standard error is denoted as epsilon
In multiple linear regression, parameter estimates are determined by the principle of least squares.
Sum of Squared Error:
SSE += (y[i] - y_hat)**2 where
the observed value while
y_hat is the predicted
Sum of Squared Regression:
SSR += (y_hat - mean_y)**2 where
y_hat is the predicted value and
mean_y is the mean of the observations.
Sum of Squared Totals:
SST += (y[i] - mean_y)**2 where
the observed value and the
mean_y is the mean of the
Mean Squared Regression:
MSR = SSR / k where k is the number of predictors
Mean Squared Error:
MSE = SSE / (n - (k + 1)) where
k is the
number of predictors and
n is the number of
Coefficient of Determination:
1 - (sse / sst). The coefficient of determination
describes how much of the data influences the line of best fit. A
number that is close to 1 means that the points all lie close to
the regression line.
Before we proceed to interpret the model, we first have to perform the Global F-test; to help us verify that the output would indeed depend on at least one of the variables. The significance of the global f-test has the following hypothesis:
Ho : B1 = B2 = ... = Bk = 0 Ha : At least one of Bj != 0
Test Statistic or F-test: for a
multiple linear regresion is as followed:
F = MSR / MSE where the degrees of freedom is
n - (k + 1). If the F value is
larger than the significance level (usually 0.05 if not
specified), we reject the null hypothesis; meaning that the
response is dependent on at least one of the predictors. Therefor,
we can proceed to interpret the multiple regresion model.
There can be instances in which one of the variables in a multiple
linear regression model will be categorical. Ideally we would
break up the
n categories into
variables. For example, suppose we are trying to predict the
success of a college student based on GPA and major (Economics,
Math, and Computer Science). GPA will be denoted as
x1 and a student majoring in Economics, Math, and
Computer Science will be denoted as
x2, x3, and x4 respectively.
x2 = 1 if major is economics x2 = 0 otherwise x3 = 1 if major is math x3 = 0 otherwise x4 = 1 if major is computer science x4 = 0 otherwise
Our model would look something like this
y = B0 + B1x1 + B2x2 + B3x3 + B4x4
SKLearn is a machine learning library for Python and R. It's known for its friendliness to beginners without the compromise for efficiency and accuracy.
Before we begin, make sure you have SKLearn installed in your machine. You can learn how to install SKLearn here
You will also need to install Python in your device. You can learn how to install Python here
Lastly, you'll need a good text editor or IDE. I personally use Jupyter Notebooks or Sublime Text 3; but you are free to use whatever works for you. If you want to tryout Jupyter notebooks, the installation guide is here or Sublime Text 3 is here
For a sample of
n = 20 individuals, we have
y = body fat,
x1 = triceps skinfold thickness,
x2 = thigh circumference, and
x3 = midarm circumference.
First we import our python libraries
from sklearn.linear_model import LinearRegression import numpy as np
The data will be using is listed below. Feel free to copy the data
or download the textfile here. Each row
X is an independent observation (predictors)
[triceps skinfold thickness, thigh circumference, midarm
y is our body fat (response).
X = np.array([[19.5, 43.1, 29.1], [24.7, 49.8, 28.2], [30.7, 51.9, 37.0], [29.8, 54.3, 31.1], [19.1, 42.2, 30.9], [25.6, 53.9, 23.7], [31.4, 58.5, 27.6], [27.9, 52.1, 30.6], [22.1, 49.9, 23.2], [25.5, 53.5, 24.8], [31.1, 56.6, 30.0], [30.4, 56.7, 28.3], [18.7, 46.5, 23.0], [19.7, 44.2, 28.6], [14.6, 42.7, 21.3], [29.5, 54.4, 30.1], [27.7, 55.3, 25.7], [30.2, 58.6, 24.6], [22.7, 48.2, 27.1], [25.2, 51.0, 27.5]]) y = np.array([[11.9,22.8,18.7,20.1,12.9,21.7,27.1,25.4,21.3,19.3,25.4,27.2,11.7,17.8,12.8,23.9,22.6,25.4,14.8,21.1]]).T
Using SKLearn, we want to use the
LinearRegression() function and fit the model.
regression_model = LinearRegression().fit(X,y)
regression_model, we can print out our
intercepts and coefficients to complete our multiple linear
B0 = regression_model.intercept_ B = regression_model.coef_ print(B0, B)
After running your code, your intercept and coefficients should look something like this
[117.08469478] [[ 4.33409201 -2.85684794 -2.18606025]]
We can also use the
predict() function to predict the
individuals body fat by inputing values for triceps skinfold
thickness, thigh circumference, midarm circumference.
prediction = reg.predict([[19.5, 43.1, 29.1]]) print(prediction)
Before we interpret the model, lets perform the Global F-test to verify that predicting the body fat would indeed depend on at least one of the predictors. Recall the Parameter Estimations above. We need to calculate the sum of squares before calculating the f value. Create a function that will calculate the sum of squares.
def get_sum_of_squares(): ssr = 0 sse = 0 sst = 0 mean_y = np.mean(y) for i in range(20): y_hat = regression_model.predict([X[i]]) ssr += (y_hat - mean_y)**2 sse += (y[i] - y_hat)**2 sst += (y[i] - mean_y)**2 return (ssr, sse, sst) sum_of_squares = get_sum_of_squares()
If you want to check your function to see if you get the right numbers, this is my output.
ssr = 396.98461183, sse = 98.40488817, sst = 495.3894999999999 # Don't worry about format
Create a function to calculate the global F-test
def global_f_test(): n = len(yy) k = 3 ssr = ss sse = ss msr = ssr / k mse = sse / (n - (k + 1)) return msr / mse
Your global F-test should be:
[[21.5157123]] which is
significantly higher than the significant value (it was not
specified so by default, our significant value is 0.05). Therefor,
we can proceed to interpret the model.
From Step 3, we know the y-intercept is
117.08469478 and our
-2.18606025 respectively. Our fitted model would look
y = 4.33409201(x1) + -2.85684794(x2) + -2.18606025(x3) + 117.08469478
Here we can interpret that
• For every unit increase of the thickness of an individuals triceps, we expect an increase in body fat by 4.33%
• For every unit increase of the thickness of an individuals thighs, we expect a decrease in body fat by -2.85%
• For every unit increase of the thickness of an individuals midarm, we expect a decrease in body fat by -2.19%
Just for practice, we can calculate the coefficient of determination. Recall the coefficient of determination formula under Parameter Estimation.
def coefficient_of_determination(): sse = ss sst = ss return 1 - (sse / sst)
coefficient of determination = [[0.80135855]]
The value is close to 1, therefor we can say that most of the points lie close to the regression line.
Sum of Squares Formula from 365datascience.com
F-test for Linear Regression from DePaul University Steve D. Jost
Problem from PennState