Fast, intuitive feature selection via regression on Shapley values PyData Global 2024

Fast, intuitive feature selection via regression on Shapley values
.ical

12-04, 14:30–15:00 (UTC), AI/ML Track

Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine learning models, improve performance, mitigate overfitting, and decrease computation time. This talk will present a novel open source feature selection framework, shap-select.
Shap-select is noteworthy because of its simplicity - it requires only one fit of the model for which one does feature selection, and yet performs comparably to much heavier methods. It conducts a linear or logistic regression of the target on the Shapley values of the features, on the validation set, and uses the signs and significance levels of the regression coefficients to implement an efficient heuristic for feature selection in tabular regression and classification tasks.
We compare this to several other methods, showing that shap-select combines interpretability, computational efficiency, and performance, offering a robust solution for feature selection, especially for real-world cases where model fitting is computationally expensive.

Feature selection is an essential process in machine learning, especially when dealing with high-dimensional datasets. It helps reduce the complexity of machine
learning models, improve performance, mitigate overfitting, and decrease computation time. This paper presents a novel feature selection framework,
shap-select. The framework conducts a linear or logistic regression of the target
on the Shapley values of the features, on the validation set, and uses the signs and
significance levels of the regression coefficients to implement an efficient heuristic
for feature selection in tabular regression and classification tasks. We evaluate
shap-select on the Kaggle credit card fraud dataset, demonstrating its effectiveness compared to established methods such as Recursive Feature Elimination
(RFE), HISEL (a mutual information-based feature selection method), Boruta
and a simpler Shapley value-based method. Our findings show that shap-select
combines interpretability, computational efficiency, and performance, offering a
robust solution for feature selection.

Prior Knowledge Expected –

No previous knowledge expected

Egor Kraev

Dr. Egor Kraev has been applying machine learning to real-world problems since last century, including economic and human development data analysis for nonprofits in the US, the UK, and Ghana, and 10 years as a quant, solutions architect, and occasional trader at UBS then Deutsche Bank.
Following last decade's explosion in AI techniques, Egor became Head of AI at Mosaic Smart Data Ltd, and for the last four years is bringing the power of AI to bear at Wise, in a variety of domains, from fraud detection to trading algorithms and causal inference for A/B testing and marketing, and now in multiple GenAI projects across the company.
In addition to having taken the Data Science team at Wise from an idea to a well-structured team of over 30 people, Egor is the founder of a startup, motleycrew.ai, aiming to take multi-agent AI systems to the next level of usability and power.

Baran Koseoglu

Senior Data Scientist @ Wise

Fast, intuitive feature selection via regression on Shapley values .ical 12-04, 14:30–15:00 (UTC), AI/ML Track

Fast, intuitive feature selection via regression on Shapley values
.ical

12-04, 14:30–15:00 (UTC), AI/ML Track