Wavelet analysis of human recombination rates demonstrates divergence on fine scales

Background: Recombination rates can be estimated across the genome, underpinning genetic analyses such as identification of regions under selection. Accurate recombination mapping requires observing a

Cortical geometry constrains the unimodal anchors of sensory integration

The human cerebral cortex is organized along a unimodal-to-transmodal hierarchy, which provides a putative substrate for the integration of sensory signals from primary cortical fields

Post-translational modifications in the brain are critical contributors to Alzheimers disease neuropathology and cognitive decline

Post-translational modifications (PTMs) in APP and MAPT contribute to plaques and tangles in Alzheimers disease (AD). Yet broader proteome-wide PTMs in the AD brain are

The Amygdalostriatal Transition Area Exhibits Lateral Amygdala-Like Spiking Activity and Tone-Shock Pairing-Induced Plasticity

During Pavlovian fear conditioning, presentation of a conditioned stimulus, such as a tone, together with an unconditioned stimulus, such as an electrical shock, excites neurons

The Logic of Thalamic Inputs onto the Molecular Taxonomy of Cortical Neurons Reveals a Visual Hierarchy

The hierarchical organization of sensory cortices and the rich molecular taxonomy of their cell types are defining features of the mammalian cortex. Cortical areas along

Large language model inference of macromolecular complex composition via model consensus and experimental data integration

May 23, 2026

Large language models (LLMs) are poised to reshape how biologists retrieve specialized knowledge at scale. Yet their performance on deep, domain-specific queries is poorly defined because much biological information resides in structured databases or large experimental datasets rather than in a free text format. One such gap in cellular biology lies in identifying major macromolecular complexes, conserved biological units essential to many cellular processes. Cataloging large complexes, such as the ribosome or RNA polymerase, along with their constituent genes, presents a significant challenge for LLMs because of their tendency to hallucinate and to produce incomplete or inconsistent lists of components. Here, we systematically evaluate six state-of-the-art LLMs on the task of retrieving the gene components of 91 protein complexes and develop an integrative framework that combines LLM output consensus with experimental multi-omics data to reconcile and filter model responses. We found that two extensions of a basic single-LLM baseline, (i) aggregating LLM outputs into a consensus and (ii) integrating LLM predictions with the experimental data, each improved retrieval accuracy. Furthermore, a consensus of LLM outputs integrated with the incomplete experimental data using a graph-theoretic approach achieved the highest accuracy (F1 score of 82.5%), compared to the best stand-alone singe LLM (F1 score of 76.4%). These results show that optimized integration of predictions from multiple LLMs and high-throughput experimental data can support scalable, semi-automated curation of specialized biological resources, providing a general template for benchmarking and deploying LLMs for the structured knowledge retrieval tasks in molecular biology.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844