• Home
  • Uncategorized
  • Systematic benchmarking of small variant calling pipelines for long-read RNA sequencing data

Background: Long-read RNA sequencing (lrRNA-seq) enables transcript-resolved variant detection, but systematic and neutral evaluations of small variants calling pipelines remain limited. The performance of existing tools across sequencing technologies, alignment strategy, variant caller choice, genomic contexts and downstream haplotype phasing is not fully understood. Results: Here, we systematically benchmark four lrRNA-seq variant callers (Clair3-RNA, DeepVariant, longcallR, and longcallR-nn), along with a widely used short-read RNA-seq variant caller (GATK HaplotypeCaller) as a baseline, using Genome in a Bottle (GIAB) datasets comprising three cell lines sequenced with four Oxford Nanopore Technologies (ONT) and two PacBio library preparation protocols. We further evaluate the impact of upstream alignment strategies, including aligner choice and alignment transformation, on variant-calling performance. Accuracy is assessed across sequencing depths and genomic contexts. Additionally, we compare haplotype phasing tools (WhatsHap, LongPhase, HapCUT2, HiPhase and longcallR) using variant calls generated by different callers to identify optimal pipeline combinations. Finally, we extend our evaluation of variant-calling performance to more recent LongBench datasets. Conclusions: Our benchmark shows that sequencing quality is the primary determinant of lrRNA-seq variant-calling performance, followed by variant caller and alignment strategy, with additional effects from genomic context. In GIAB datasets, all lrRNA-seq-specific callers performed reasonably well, with Clair3-RNA (across both ONT and PacBio) and DeepVariant (for PacBio) ranking among the top-performing methods. In more recent LongBench datasets of cancer cell lines, DeepVariant and longcallR showed higher sensitivity, whereas Clair3-RNA and longcallR-nn were more conservative, yielding fewer variant calls. For downstream haplotype phasing, we recommend WhatsHap or HapCUT2 for most libraries, owing to their high phasing coverage and accuracy, respectively, while longcallR performs better on ONT dRNA004 datasets across both metrics.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844