arXiv:2603.17205v1 Announce Type: cross
Abstract: Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9%) and retrieval (Recall@20 +0.7%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50% of the training time required by standard finetuning.
Three immunoregulatory signatures define non-productive HIV infection in CD4+ T memory stem cells
The persistent HIV reservoir constitutes the main obstacle to curing HIV/AIDS disease. Our understanding of how non-productive HIV infections are established in primary human CD4+




