TY - JOUR
T1 - Improving the performance and interpretability on medical datasets using graphical ensemble feature selection
AU - Battistella, Enzo
AU - Ghiassian, Dina
AU - Barabási, Albert László
N1 - © The Author(s) 2024. Published by Oxford University Press.
PY - 2024/6/5
Y1 - 2024/6/5
N2 - Motivation: A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. Results: Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML.
AB - Motivation: A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. Results: Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML.
KW - Algorithms
KW - Arthritis, Rheumatoid
KW - Computational Biology/methods
KW - Humans
KW - Machine Learning
UR - http://www.scopus.com/inward/record.url?scp=85196688503&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btae341
DO - 10.1093/bioinformatics/btae341
M3 - Article
C2 - 38837347
AN - SCOPUS:85196688503
SN - 1367-4803
VL - 40
JO - Bioinformatics
JF - Bioinformatics
IS - 6
M1 - btae341
ER -