What is Multiple Regression?
Consider a situation where a number of different factors (called predictor or independent variables) interplay to determine an outcome (called the criterion or dependent variable). Depending on what the factors are different outcomes may be achieved. For example, the price at which a house sells (the criterion) may be determined by a number of factors (the predictors) like the location of the house, the year it was built, the state of the local housing market, the condition of the house, etc.
Multiple regression is used to build a model that allows us to study this interplay. A multiple regression based model will use the data to build a function that predicts the outcome based on the independent variables. This model is built using, for example, a set of real world data listing the outcome in various cases. The model can then be used to predict the outcome given a set of independent variables or to find out how well the existing data fits the model and if there are any outliers.
Why do we need it?
Multiple regression can be used in wide variety of fields. For example Human Resource professionals may gather data on the salary an employee gets based on various factors like his experience, field of work, competence, etc. They can then build a model based on this data and use it in their own company to set salaries to check where their own employees fit into the model. Are certain employees or groups paid more than normal? Less than normal?
Similarly, different researchers might use regression to figure out what are the best predictors of a particular outcome. For example, what independent variables are needed to best fit the outcomes that are seen. What factors are responsible for how well a school does in its test scores? What are the factors that affect the productivity of a supply chain?
How is Multiple Regression done?
1. Two Variable Case: Let us start with a simple case of one independent variable X which predicts the outcome Y. For example, X could be the years of experience an employee has and Y his or her salary. If we plot the X and Y values on a graph.
The purpose of Multiple Regression is to find the line that best fits the distribution of points. For example the blue line below approximately fits the red dots which represent our data. The best fit line can be said to model the relationship between X and Y such that given a value for X we can ‘predict’ the best possible value for Y. This line can then be represented by the equation:Y = a + bX
This equation is called the “Regression Equation.”Our problem is reduced to finding out the best values for ‘a’ and ‘b’.
2. Multiple Variable Case: We can extend the case above to multiple independent variables X1, X2, X3,… Xn which predict the outcome Y. Just as above we can fit a best case “line” that predicts Y based on the values of X1, X2, X3, … Xn. This “line” will take the form of the regression equation: Y = a0 + a1X1 + a2X2 + … + anXn
Our problem is thus reduced to finding out the best possible values for the coefficients given a set of values for Y and Xi.
Calculating Coefficients
To calculate the coefficients to build a Multiple Regression model using a set of given data points we use the Least Squares method. In this method,
“We compute the coefficients such that squared deviations of the points from the line are minimized.”
How good is our model?
After finding the coefficients for our equation we need to know how good our model is. For this we need a measure of how well the data points fit the model equation. This is represented by the “Correlation Coefficient” which also represents how well the independent variables Xi predict the outcome Y.
This is usually calculated by first finding the deviation of the each point from the line. Let us call the deviation of point ‘i’ from the line ri.
Assumptions and Limitations
Multiple Regression works well only under certain conditions. Underlying the technique are the following assumptions which must be true for the model to work well.
1. As our equation above is linear the relationship between the variables must be Linear. Non linear relationships need other forms of regression.
2. The distribution of the deviations from the line must be a “Normal” distribution.
3. A good model predicts a relationship but not a cause. The existence of a good model does not mean that the independent variables cause the outcome only that their presence indicates the outcome which may be due to a correlation.
4. “Independence” of variables. The predictor variables are assumed to be independent. If they strongly depend upon each other the model will not be very good.
Conclusion: Within its limitations Multiple Regression is a good technique that applies to a large number of real world situations and is widely used to build simple easy to use models. These can be used for analysis of data in wide ranging fields like business, medicine, engineering, and others.
Click here to register for Six Sigma online training offerings from 6Sigma.us.