Fast, intuitive feature selection via regression on Shapley values
Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine learning models, improve performance, mitigate overfitting, and decrease computation time. This talk will present a novel open source feature selection framework, shap-select.
Shap-select is noteworthy because of its simplicity - it requires only one fit of the model for which one does feature selection, and yet performs comparably to much heavier methods. It conducts a linear or logistic regression of the target on the Shapley values of the features, on the validation set, and uses the signs and significance levels of the regression coefficients to implement an efficient heuristic for feature selection in tabular regression and classification tasks.
We compare this to several other methods, showing that shap-select combines interpretability, computational efficiency, and performance, offering a robust solution for feature selection, especially for real-world cases where model fitting is computationally expensive.