Java Refactoring Agents Hit 15.3% Pass Rate on IBM ScarfBench
IBM Research released ScarfBench to evaluate cross-framework Java migrations, showing current AI agents peak at a 15.3% pass rate on refactoring tasks.
IBM Research released ScarfBench on June 30, 2026, to measure how well AI agents handle cross-framework enterprise Java migrations. The open-source dataset shifts the evaluation target from isolated bug fixing to full architectural modernization. If you build systems for automated code refactoring, this benchmark establishes a new baseline for what production agents can actually accomplish.
Framework Migration Constraints
ScarfBench, or the Self-Contained Application Refactoring Benchmark, targets migrations across the Spring, Jakarta EE, and Quarkus ecosystems. Application modernization requires preserving dependency injection patterns, transaction management, and security configurations during code translation. Most existing datasets fail to test these interconnected behaviors.
The dataset contains 34 applications yielding 102 variants. It encompasses 151,000 lines of Java code distributed across 1,946 source and test files. The suite defines 204 directed refactoring tasks where an agent receives a source application and must synthesize a target implementation in a different framework.
ScarfBench relies on an executable oracle rather than static string matching. Target applications must compile successfully with Maven and deploy in a Docker containerized runtime. They must also pass behavioral tests on their observable interfaces. This mirrors the strict operational requirements of enterprise IT environments.
Benchmark Results
State-of-the-art coding agents struggle heavily with cross-framework refactoring. IBM evaluated five current models and found that autonomous migration remains largely unsolved. The most capable model peaked well below the threshold for reliable production deployment.
| Evaluation Metric | Highest Agent Success Rate |
|---|---|
| Focused Layer Migration | 15.3% |
| Whole Application Migration | 12.2% |
| Full Behavioral Parity | 0.49% (1 of 204 tasks) |
Only a single task out of 204 resulted in a target application that was fully behaviorally equivalent to the source application. The benchmark data highlights a wide variance in difficulty depending on the target ecosystem. Migrations between Spring and Quarkus proved the most tractable for current models. Tasks targeting Jakarta EE presented a significantly higher failure rate.
This massive performance drop on multi-file refactoring aligns with broader patterns where AI agents fail at complex tasks that require global context retention. When evaluating and testing AI agents on large codebases, isolated functional tests often mask deeper architectural regressions.
Strategic Implications
The push toward Agentic AI in portfolios like IBM watsonx requires models capable of executing long-horizon tasks across dozens of files. Cross-framework refactoring represents a multibillion-dollar operational bottleneck for enterprise software engineering. Current benchmarks like SWE-bench adequately test localized code generation, but they do not measure the holistic system comprehension required for framework migration.
Plan your modernization initiatives around human-in-the-loop workflows rather than autonomous execution. If you deploy coding agents for Java refactoring, restrict their scope to isolated layers and enforce strict manual validation of dependency injection and transaction boundaries.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Choose Between GPT-5.4 Mini and Nano for Coding Agents and High-Volume API Tasks
Learn when to use GPT-5.4 mini vs nano for coding, tool use, subagents, and cost-sensitive API workflows.
IBM Pivots to Agent Logic to Control Multi-Step AI Workflows
A joint technical publication from IBM and Hugging Face details how strict state management and formal logic layers can govern long-running enterprise agents.
Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs
IBM and Hugging Face launched a benchmark that evaluates autonomous agents as complete systems, measuring both task success rates and the USD cost per run.
Why AI Agents Still Fail at Complex Tasks
A new IBM Research analysis explores the VAKRA benchmark, revealing how top AI models struggle with multi-hop reasoning and live API chaining in enterprise tools.
IBM ALTK-Evolve Lets AI Agents Learn From On-the-Job Mistakes
IBM Research introduces ALTK-Evolve, a new framework that enables AI agents to autonomously improve their performance through real-time environment feedback.