Java Refactoring Agents Hit 15.3% Pass Rate on IBM ScarfBench

IBM Research released ScarfBench on June 30, 2026, to measure how well AI agents handle cross-framework enterprise Java migrations. The open-source dataset shifts the evaluation target from isolated bug fixing to full architectural modernization. If you build systems for automated code refactoring, this benchmark establishes a new baseline for what production agents can actually accomplish.

Framework Migration Constraints

ScarfBench, or the Self-Contained Application Refactoring Benchmark, targets migrations across the Spring, Jakarta EE, and Quarkus ecosystems. Application modernization requires preserving dependency injection patterns, transaction management, and security configurations during code translation. Most existing datasets fail to test these interconnected behaviors.

The dataset contains 34 applications yielding 102 variants. It encompasses 151,000 lines of Java code distributed across 1,946 source and test files. The suite defines 204 directed refactoring tasks where an agent receives a source application and must synthesize a target implementation in a different framework.

ScarfBench relies on an executable oracle rather than static string matching. Target applications must compile successfully with Maven and deploy in a Docker containerized runtime. They must also pass behavioral tests on their observable interfaces. This mirrors the strict operational requirements of enterprise IT environments.

Benchmark Results

State-of-the-art coding agents struggle heavily with cross-framework refactoring. IBM evaluated five current models and found that autonomous migration remains largely unsolved. The most capable model peaked well below the threshold for reliable production deployment.

Evaluation Metric	Highest Agent Success Rate
Focused Layer Migration	15.3%
Whole Application Migration	12.2%
Full Behavioral Parity	0.49% (1 of 204 tasks)

Only a single task out of 204 resulted in a target application that was fully behaviorally equivalent to the source application. The benchmark data highlights a wide variance in difficulty depending on the target ecosystem. Migrations between Spring and Quarkus proved the most tractable for current models. Tasks targeting Jakarta EE presented a significantly higher failure rate.

This massive performance drop on multi-file refactoring aligns with broader patterns where AI agents fail at complex tasks that require global context retention. When evaluating and testing AI agents on large codebases, isolated functional tests often mask deeper architectural regressions.

Strategic Implications

The push toward Agentic AI in portfolios like IBM watsonx requires models capable of executing long-horizon tasks across dozens of files. Cross-framework refactoring represents a multibillion-dollar operational bottleneck for enterprise software engineering. Current benchmarks like SWE-bench adequately test localized code generation, but they do not measure the holistic system comprehension required for framework migration.

Plan your modernization initiatives around human-in-the-loop workflows rather than autonomous execution. If you deploy coding agents for Java refactoring, restrict their scope to isolated layers and enforce strict manual validation of dependency injection and transaction boundaries.

Java Refactoring Agents Hit 15.3% Pass Rate on IBM ScarfBench

Framework Migration Constraints

Benchmark Results

Strategic Implications

Keep Reading

How to Choose Between GPT-5.4 Mini and Nano for Coding Agents and High-Volume API Tasks

IBM Pivots to Agent Logic to Control Multi-Step AI Workflows

Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs

Why AI Agents Still Fail at Complex Tasks

IBM ALTK-Evolve Lets AI Agents Learn From On-the-Job Mistakes