-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the Regression-Classification wiki!
# SUMMARY AND INTERPRETATIONS
We have to study the effects of eight input variables (predictors) on two output variables (responses) namely, heating load(Y1) and cooling load(Y2) of the residential buildings. Our job is to investigate the association of each input variable with each response variable using statistical analyzing methods in order to identify the related input variables.
Tools Used:
• __Multiple Linear Regression, __
**• K-means Clustering,**
• Multinomial Regression.
We can see this problem in two different aspects: Regression and Classification. For regression, MLR, a classical linear approach is used as a baseline model in order to predict each of the response variables. However, for better results we should apply non-linear, non-parametric approaches. On other hands, if we just round off the given values of the responses, we obtain a multi-class classification problem on which we can apply K-means Clustering which will group unlabeled groups into pre-defined clusters and further we can fit multinomial logistic regression on data and also investigate the influence of number of classes on the accuracy of the model.
- There are 768 observations in total.
- All the variables are continuous.
- There are no missing values.
- Relation between the variables is analysed using Pearson Correlation Coefficients. It lies between -1≤ 𝑟 ≤1, where a negative sign indicates an inverse relation & positive indicates a proportional relation.
- Scatter plot is used to visualise the correlations.
- We obtain intercept and slope parameters for each of the two response variables.
- ANOVA-TABLE: (Overall test of adequacy) ** H0: ßi =0 or Xi can be deleted from the model . H1: ßi ≠0 p-value > 0.05 ⟹ accept H0 or remove Xi from the model. In our case we remove X4, X6 for both Y1 &Y2. 8. FINAL MODEL (SUMMARY TABLE):** • H0: ß=0 • H1: ß≠0 or overall model is significant. • p-value < 0.05 ⟹ model is significant. • ADJUSTED R2: R2 Coefficient of determination. It tells goodness of fit of the model. Y1: 0.9158 & Y2: 0.8872 9. • Correlation plots, histogram, residual plots and various tests to check the assumptions states that linear techniques are not suitable for the data.** A non-linear model is more suitable for the available data.**
- In order to perform multinomial logistic regression first step is to round-off Y1/Y2, apply feature-scaling to the data and then k-means clustering where we have to choose no. of clusters for each of the response variables and further minimize within cluster sum of squares. With the help of Elbow plot, we can decide no. of clusters for each of the response variables
- After deciding the no. of clusters (two in our case), we are left with two categories of Y1 and Y2 each. Observations (buildings) =384 falls in category1,2 each for both Y1 andY2.
- Further we divide dataset to training (60%) and test (40%) sets. Model is fitted on training set and evaluated using test set.
- We perform z test to find the significant X variables. P-value > 0 ⟹ remove this X variable from the model. In our case all variables are significant.
- Construct final model after removing insignificant variables.
- CONFUSION MATRIX: Diagonal values show the number of correct classifications.
- ACCURACY: ** Proportion of observations correctly classified by the model. We find that accuracy of our model where there are only two classes for each response variable is 1/100%. Less number of classes means we may lose some influence of variables on the model.** If we choose clusters =5, accuracy is nearly 70% for both Heating and Cooling loads i.e. When we have five classes each.
"VARIABLE" and their "POSSIBLE VALUES (initially)" X1 =12, X2 =12, X3 =7, X4 =4, X5 =2, X6 =4, X7 =4, X8 =6, Y1 =586, Y2 =636.
• In order to predict no. of efficient and non-efficient buildings we have to study each of the independent variable. • We found that X5 has two possible values: 3.5 & 7. • In correlation matrix previously we see that X5 is strongly correlated with Y1 &Y2. • No. of buildings corresponding to each 3.5 & 7 are 384. • Earlier in classification we saw when we choose two clusters/categories for Y1 and Y2 each, number of buildings falling under each categories were 384. • Hence, we can say that: **category 1: Efficient Buildings and ** **category 2: Non-Efficient Buildings ** **• Therefore, X5 is suitable to differentiate between efficient and non-efficient buildings with respect to Y1&Y2. **
RESULTS: category 1: Efficient Buildings (384): the one whose height (X5) = 3.5 (both heating and cooling load values<24) category 2: Non-Efficient Buildings (384): the one whose height (X5) = 7