Conjuring Semantic Similarity

arXiv:2410.16431v4 Announce Type: replace Abstract: The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by

The field of ancient metagenomics provides insights into past microbiomes, but with a growing dataset size, methods that rely on reference databases have limited scope. Here, we introduce DIANA, a multi-task neural network that predicts key metadata categories from unitig abundances. Trained on 2,597 run accessions (1.72~Tbp of assembled unitig sequences), DIANA accurately identifies sample host (94.6%), community type (90.0%), and material (88.9%) on held-out test data and demonstrates robust generalisation on an independent validation set. A key innovation is DIANA’s ability to perform semantic generalisation, correctly classifying samples with labels unseen during training — such as novel subspecies — to their appropriate parent categories. By leveraging both known and uncharacterized genomic sequences, DIANA provides a rapid, data-driven system for metadata validation and quality control, accelerating discovery in ancient metagenomics research.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844