AI Wrapped: The 14 AI terms you couldn’t avoid in 2025

AI Wrapped: The 14 AI terms you couldn’t avoid in 2025

If the past 12 months have taught us anything, it’s that the AI hype train is showing no signs of slowing. It’s hard to believe

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

arXiv:2512.20638v1 Announce Type: cross Abstract: The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a

AI-Driven Green Cognitive Radio Networks for Sustainable 6G Communication

arXiv:2512.20739v1 Announce Type: cross Abstract: The 6G wireless aims at the Tb/s peak data rates are expected, a sub-millisecond latency, massive Internet of Things/vehicle connectivity,

Stabilizing Multimodal Autoencoders: A Theoretical and Empirical Analysis of Fusion Strategies

arXiv:2512.20749v1 Announce Type: cross Abstract: In recent years, the development of multimodal autoencoders has gained significant attention due to their potential to handle multimodal complex

Generalization of RLVR Using Causal Reasoning as a Testbed

arXiv:2512.20760v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

December 22, 2025

arXiv:2512.17419v1 Announce Type: cross
Abstract: Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today’s strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844