Results

Linear Regression

Overall, the model is a significant fit to the data, F(3, 227) = 17.07, p < .001 (see ANOVA). The adjusted 𝑅2 (0.173) suggests that 17.3% of the variance in salaries can be explained by the model when adjusting for the number of predictors.

Model Summary - salary
Model R Adjusted R² RMSE
M₀ 0.000 0.000 0.000 16.026
M₁ 0.429 0.184 0.173 14.572
Note.  M₁ includes age, years, status
ANOVA
Model   Sum of Squares df Mean Square F p
M₁ Regression 10871.964 3 3623.988 17.066 < .001
  Residual 48202.790 227 212.347  
  Total 59074.754 230  
Note.  M₁ includes age, years, status
Note.  The intercept model is omitted, as no meaningful information can be shown.
Coefficients
Collinearity Statistics
Model   Unstandardized Standard Error Standardized t p Tolerance VIF
M₀ (Intercept) 11.338 1.054 10.753 < .001  
M₁ (Intercept) -60.890 16.497 -3.691 < .001  
  age 6.234 1.411 0.942 4.418 < .001 0.079 12.653
  years -5.561 2.122 -0.548 -2.621 0.009 0.082 12.157
  status -0.196 0.152 -0.083 -1.289 0.199 0.867 1.153

Based on the Coefficients table, it seems as though salaries are significantly predicted by the age of the model. This is a positive relationship (look at the sign of the beta), indicating that as age increases, salaries increase too. The number of years spent as a model also seems to significantly predict salaries, but this is a negative relationship indicating that the more years you’ve spent as a model, the lower your salary. This finding seems very counter-intuitive, but we’ll come back to it later. Finally, the status of the model doesn’t seem to predict salaries significantly.


The next part of the question asks whether this model is valid (we will examine the assumptions).

1. Multicollinearity: For the age and years variables, VIF values are above 10 (or alternatively, tolerance values are all well below 0.2), indicating multicollinearity in the data (see above). This indicates these variables may measure very similar things. Of course, this makes perfect sense because the older a model is, the more years she would’ve spent modelling! So, it was fairly stupid to measure both of these things! This also explains the weird result that the number of years spent modelling negatively predicted salary (i.e. more experience = less salary!): in fact if you do a simple regression with years as the only predictor of salary you’ll find it has the expected positive relationship. This hopefully demonstrates why multicollinearity can bias the regression model.

Collinearity Diagnostics
Variance Proportions
Model Dimension Eigenvalue Condition Index (Intercept) age years status
M₁ 1 3.925 1.000 0.000 0.000 0.001 0.000
  2 0.070 7.479 0.009 0.000 0.080 0.016
  3 0.004 30.758 0.299 0.017 0.013 0.944
  4 9.781×10-4 63.344 0.692 0.983 0.906 0.040
Note.  The intercept model is omitted, as no meaningful information can be shown.
Influential Cases
Case Number Std. Residual salary Predicted Value Residual Cook's Distance
5 4.697 95.338 28.265 67.073 0.227
116 3.440 64.791 14.926 49.865 0.031
135 4.717 89.980 21.895 68.085 0.108
155 3.319 74.861 27.403 47.458 0.106
191 3.178 50.656 4.716 45.939 0.041
198 3.531 71.321 20.173 51.148 0.038

2. Residuals: In Influential Cases there are six cases that have a standardized residual greater than 3, and two of these are fairly substantial (case 5 and 135). We have 5.19% of cases with standardized residuals above 2, so that’s as we expect, but 3% of cases with residuals above 2.5 (we’d expect only 1%), which indicates possible outliers.E

3.Homoscedasticity: The scatterplot below of Residuals vs Predicted does not show a random pattern, with visible funneling, indicating heteroscedasticity thereby violating the assumption of homoscedasticity.

Residuals vs. Predicted

Q-Q Plot Standardized Residuals

4.Normality of errors: In the normal Q–Q plot above, the dashed line deviates considerably from the straight line (which indicates what you’d get from normally distributed errors). This indicates the normality of errors assumption has been broken.

Conclusion about model fit:

Due to the multicollinearity, large residuals for certain cases, and violation of homoscedasticity, it can be concluded that several assumptions have not been met and so this model is probably fairly unreliable.