McKay, Steve (2019) When 4 ≈ 10,000: The power of social science knowledge in predictive performance. Socius, 4 . ISSN 2378-0231
Full content URL: https://doi.org/10.1177/2378023118811774
Documents |
|
![]() |
Microsoft Word
socius_submission_v3b_clean.docx - Whole Document Available under License Creative Commons Attribution-NonCommercial 4.0 International. 119kB |
Item Type: | Article |
---|---|
Item Status: | Live Archive |
Abstract
Computer science has devised leading methods for predicting variables; can social science compete? The author sets
out a social scientific approach to the Fragile Families Challenge. Key insights included new variables constructed
according to theory (e.g., a measure of shame relating to hardship), lagged values of the target variables, using predicted
values of certain outcomes to inform others, and validated scales rather than individual variables. The models were
competitive: a four-variable logistic regression model was placed second for predicting layoffs, narrowly beaten by a
model using all the available variables (>10,000) and an ensemble of algorithms. Similarly, a relatively small random
forest model (25 variables) was ranked seventh in predicting material hardship. However, a similar approach overfitted
the prediction of grit. Machine learning approaches proved superior to linear regression for modeling the continuous
outcomes. Overall, social scientists can contribute to predictive performance while benefiting from learning more
about data science methods.
Keywords: | fragile families, logistic regression, data science, random forests |
---|---|
Subjects: | L Social studies > L400 Social Policy |
Divisions: | College of Social Science > School of Social & Political Sciences |
ID Code: | 34204 |
Deposited On: | 28 Nov 2018 11:39 |
Repository Staff Only: item control page