Multiple linear regression is used to predict an independent variable based on multiple dependent variables. In this article, I would cover how you can predict Co2 emission using sklearn (python library) + mathematical notations .
Why do we use Multiple Linear Regression ?
- To identify strength of the effect of independent variable have on dependent variable e.g.: Does lecture attendance and gender have any effect on exam performance of students ?
- To predict impacts of changes: To identify how dependent variable changes when we change independent variables e.g.: How a patient blood pressure increase/decrease for every unit increase/ decrease in BMI (holding other factors constant)
Mathematical Notation:
In Multiple linear regression Independent variable (y) is a linear combination of dependent variables (x)
Unlike, simple linear regression multiple linear regression doesn’t have a line of best fit anymore instead we use plane/hyperplane.
“Our goal is to find the best fit hyper plane for the data”
Finding optimized parameters for the hyperplane:
The most common method is to find Mean Squared Error (MSE)
- MSE shows how squared residual error is represented in the model
- Residual error is calculated by finding difference between actual value and predicted value
Estimate theta (parameter):
- Ordinary Least Squares:
OLS estimates the value of coefficients by minimizing mean square error. downside to using OLS it takes a very long time due to matrix operations
2. Optimization Approach / Gradient Descent:
In this process the error is minimized using an iterative process.
Gradient Descent starts optimization with a random value for each coefficient and calculates error and iteratively changes the coefficient values to minimize error.
This is a proper approach when dealing with large datasets.
Cautions:
- Adding too many independent variables would result in “overfitting”
- Independent variables should always be continuous
- Visually check the linearity between variable using scatter plot before starting multiple linear regression. If there is no linearity between variables, you should use non-linear regression
Load data and Check Linearity using scatter plot:
Once training and testing data is split, we can plot again and see if training data distribution
coefficients: [[11.01246952 7.62661829 9.56427884]]
Residual Square Error : 0.88
By finding coefficients, we determined the relationship between independent variable and dependent variable. Now we have all the parameters needed to predict.
Happy Coding !!