• Home
  • Uncategorized
  • Benchmarking end-to-end genotype-to-phenotype prediction workflows across 80 openSNP phenotypes

arXiv:2603.06768v2 Announce Type: replace
Abstract: Genotype-to-phenotype prediction is a central goal of statistical genetics, yet practical comparisons of prediction workflows remain limited in small, heterogeneous, participant-shared genomic datasets. Here, we benchmarked end-to-end case-control prediction across 80 curated binary phenotypes from openSNP using machine learning, deep learning, and polygenic score workflows. We evaluated 29 machine-learning algorithms, 80 deep-learning model variants, and 3 polygenic score tools across 675 clumping and pruning configurations. No workflow family dominated universally. Polygenic score workflows achieved the highest observed discrimination for 53 phenotypes, whereas machine-learning or deep-learning workflows achieved the highest for 27. However, many apparent phenotype-level wins were modest, with 41.2% of comparisons representing practical ties within five discrimination points. Performance was strongly phenotype-dependent and sensitive to modeling and preprocessing choices. Distinct workflow-specific failure modes were also observed, including unstable behaviour in PRSice and non-informative collapse in lassosum for 13 phenotypes. Higher peak performance was concentrated in smaller phenotypes, reinforcing the need for cautious interpretation in limited-data settings. The cohort was predominantly of European ancestry, restricting generalisability. Together, these results position openSNP as a useful stress-test environment for genomic prediction and support benchmark-guided workflow selection under realistic conditions of data scarcity, phenotype heterogeneity, and ancestry imbalance.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844