• Home
  • Uncategorized
  • Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

arXiv:2509.21028v2 Announce Type: replace
Abstract: We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs’ ability to effectively perform numerical operations and accurately locate information in long contexts.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844