Build Multiple Linear Regression using sklearn (Python)

Krishna K
3 min readOct 30, 2020

--

Multiple linear regression is used to predict an independent variable based on multiple dependent variables. In this article, I would cover how you can predict Co2 emission using sklearn (python library) + mathematical notations .

Why do we use Multiple Linear Regression ?

  1. To identify strength of the effect of independent variable have on dependent variable e.g.: Does lecture attendance and gender have any effect on exam performance of students ?
  2. To predict impacts of changes: To identify how dependent variable changes when we change independent variables e.g.: How a patient blood pressure increase/decrease for every unit increase/ decrease in BMI (holding other factors constant)

Mathematical Notation:

In Multiple linear regression Independent variable (y) is a linear combination of dependent variables (x)

theta is the parameter / coefficient

Unlike, simple linear regression multiple linear regression doesn’t have a line of best fit anymore instead we use plane/hyperplane.

“Our goal is to find the best fit hyper plane for the data”

Finding optimized parameters for the hyperplane:

The most common method is to find Mean Squared Error (MSE)

  • MSE shows how squared residual error is represented in the model
  • Residual error is calculated by finding difference between actual value and predicted value

Estimate theta (parameter):

  1. Ordinary Least Squares:

OLS estimates the value of coefficients by minimizing mean square error. downside to using OLS it takes a very long time due to matrix operations

2. Optimization Approach / Gradient Descent:

In this process the error is minimized using an iterative process.

Gradient Descent starts optimization with a random value for each coefficient and calculates error and iteratively changes the coefficient values to minimize error.

This is a proper approach when dealing with large datasets.

Cautions:

  1. Adding too many independent variables would result in “overfitting”
  2. Independent variables should always be continuous
  3. Visually check the linearity between variable using scatter plot before starting multiple linear regression. If there is no linearity between variables, you should use non-linear regression

Load data and Check Linearity using scatter plot:

Linearity Exists

Once training and testing data is split, we can plot again and see if training data distribution

coefficients:  [[11.01246952  7.62661829  9.56427884]]
Residual Square Error : 0.88

By finding coefficients, we determined the relationship between independent variable and dependent variable. Now we have all the parameters needed to predict.

Happy Coding !!

--

--

Krishna K
Krishna K

Written by Krishna K

Data Scientist, The World Bank. I blog about data science, machine learning, and building web apps & APIs.

No responses yet