arXiv:2603.15164v1 Announce Type: cross
Abstract: Evaluating AI-generated research ideas typically relies on LLM judges or human panels — both subjective and disconnected from actual research impact. We introduce hs, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p=0.584$), while hs shows the retrieval-augmented system produces 2.5$times$ higher-scoring ideas ($p<0.001$). Moreover, hs scores are emphnegatively correlated with LLM-judged novelty ($rho=-0.29$, $p<0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844