2013-06-24

Join our Exclusive FREE Webinar on Multivariate Linear Regression on 24 June at 8 PM EDT: CLICK HERE

Doing Regression Analysis has always been a tough task for me as there are many complicated terms involved in it like “Heteroskedacity”, “Multicollinearity”, “Adjusted R2” and many more. Since I was involved in the Risk Analytics I had a clear understanding that playing with data is not an easy task and you have to be very smart in dealing with it.

The lack of practical knowledge on Regression Analysis or reluctance on taking training made my job really tough and I had to face difficulties in dealing with the task. Owing to this I used to waste several hours for doing a task which was merely of one-two hours.

I was assigned the task of doing Variable selection and creating customized bands. The process was tough for me as I had to first understand and then come to a conclusion. I want to help you guys so that you don’t end up landing in the same situation like me.

Task 1 of Regression Analysis: Selection of Variables

In Regression there involves “Selecting Variables” which will prove relation with the dependent variable. With the Insurance data assigned to me previously I had the age, age group, Vehicle age, Vehicle age group, years of driving experience and so on.



In Regression Analysis, selecting right variables is an important point as the final decision is taken depending on this relation. The first task I had with these Independent variables was to check for losses (dependent variable) due to these variables. “Multicollinearity” means the correlation between independent variables, the lower this correlation the better it is for us.

Also Banding of variables is also an important criterion we have to look in Regression. This involves selecting the Age or Age band or Vehicle age or Vehicle age band. Once we are done with checking for Multicollinearity and banding of variables, a statistical significance of variables is tested.

Task 2 of Regression Analysis: Checking the Correlation

With the data in hand, I first started taking out the correlation between age and years of driving experience. Getting Correlation with so much data in hand wasn’t an easy task, but it got simplified with the help of MS Excel. I used the function CORREL and selected the column of age and Years of driving experience and got the result within a span of few seconds.

 



Now the result I got showed a high correlation between the two variables, thus failing the multicollineary test. As I knew I can use either of the variables but the question was which variable I should choose. I searched on the internet and looked in some books and came out with an answer that variable with a better fit should be used.

How to select a variable that best fits the model can be done on the basis of running a regression between dependent and independent variables, that is capped losses and age after that capped losses and years of driving experience. This is by fitting two models independently. I used excel and came to know the result. Let me show you how.

Step1: Go to Data -> Data Analysis from the ribbon menu.



Step2: Select the Regression function.

 

Step 3: Put the respective inputs, in this case capped losses and age.

 

Step 4: In the new sheet we have the result of the Regression

 

Similarly with the same steps, capped losses and Years of driving experience Regression were run.

The final decision was based on R2 which shows that how much an independent variable explains the dependent variable. Since age had a R2, it was selected over years of driving experience.

You can also attend our Free Webinar to know more about Regression and R2

Task 3 of Regression Analysis: Customization of Bands

Now once I checked for Multicollinearity and which variables to be used, I had to check whether to use the age band or age. To do this again we have to run a Regression using age and age band. The problem arises as running the regression will be in Excel and Excel doesn’t give results for non-numerical values.

For this the age band has to be converted to average age so that it becomes numerical variable for a categorical variable.

 

 For each Age band Average Age is provided.

 

Since for each policy there should be Average Age, we use V-look up function

Now with the Age and Average age and which of the variable is to be selected out of it, I ran a Regression in the similar way as I did for Years of driving experience and Average age.

 

Now that I had the result and a comparison after running the regression, I took the average age or to be clearer the age band over age for the simple reason being the R-square. Thus in selecting the variables I would use age band over age for the age factor.

In the same way I performed the regression for Vehicle age and age band. I got the similar result of Average Vehicle age or Age band over Vehicle age.

Task 4 of Regression Analysis: Dummy Variables

As I discussed before, excel doesn’t take any other values apart from Numerical it was necessary to convert Binary variables present in the data to Numerical one.

The Binary variables such as Gender, fuel type etc needs to be converted. This can be done by giving values to the variables like in the case of Married status.

Task 5 of Regression Analysis: Regression Inputs

Finally after selecting the variables based on R2, taking out the band and then creating dummy variables we have the data on which we can run the Multivariate Regression.

 

Snapshot of the final data on which we will run the Multivariate Regression

In this way after hours of search and learning on different Excel functions I was able to create the data needed to run the Regression Analysis.

This task for you can be made easy with the help of our Free Webinar on Regression Analysis.

For indepth study of Analytics , have a look at our Business Analytics Program.

Show more