OrthoFinder’s all-vs-all DIAMOND step systematically misses single-copy orthogroups (SC OGs) at deep taxonomic divergence: a marker recovered cleanly within a tightly defined cohort is dropped when the same marker is searched against phylum-broad metagenome-assembled genome (MAG) sets, because pairwise sequence similarity falls below DIAMOND’s detection threshold even when the underlying ortholog is present. The result is biased dropout – supermatrices that retain genomes near the cohort but lose genomes from the deeper, more diverged corners of the same phylum. We describe a two-stage cohort-HMM recruitment pipeline (per-OG profile HMMs built from cohort alignments, then hmmsearch against the broader proteome set) followed by an independent per-OG gene-tree QC step that classifies each recruited hit relative to the cohort’s most recent common ancestor (MRCA) descendant set, with a per-MAG paralog-rate filter applied before supermatrix concatenation. We characterize the pipeline across three taxonomic ranks. At phylum scale (Omnitrophota, 97 cohort OGs, 714 NCBI MAGs), the recruitment recovers MAGs that the OrthoFinder-only supermatrix would otherwise drop, and the QC identifies 2 deep-peripheral MAGs – divergent genomes whose per-OG tips repeatedly place outside the cohort MRCA descendant set despite being orthologs – that the per-MAG filter removes. At family scale (Pelagibacteraceae, 146 cohort OGs, 366 NCBI MAGs) and at genus scale (Actinomarina, 289 cohort OGs, 23 NCBI MAGs), the per-tip paralog-candidate rate drops to 0.0 %. The pipeline addresses two independent failure modes. Cohort paralog density breaks strict-SC OG discovery at the cohort step (the family-rank case, where every candidate marker has at least one cohort species carrying multiple copies; the relaxed cohort criterion supplies the marker set and HMM recruitment disambiguates which copy each NCBI MAG contributes). DIAMOND-reach attrition breaks OG assignment for the most divergent NCBI MAGs (the phylum-rank case, where pairwise similarities fall below DIAMOND’s detection threshold; HMM recruitment recovers the dropouts and the per-OG QC step filters residual paralog candidates). At genus rank both modes are inactive and OrthoFinder suffices directly; HMM recruitment runs but finds no new orthologs. Code and per-case data products are released as a community resource at Zenodo (DOI 10.5281/zenodo.20422348).

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844