vLLM V1 Migration: Fix Logprobs Before RL Corrections
ServiceNow's vLLM V1 migration shows why RL pipelines need backend logprob parity before objective-level corrections.
ServiceNow AI published a vLLM V0-to-V1 migration report showing a practical failure mode in reinforcement learning pipelines: the serving backend can look compatible while quietly changing the logprobs that drive policy updates. The team was migrating PipelineRL rollout generation from vLLM 0.8.5 to vLLM 0.18.1, and the initial V1 run diverged from the V0 reference before objective-level RL corrections were even the right question.
The useful takeaway is narrow and important. If your trainer computes policy ratios, KL, clip rate, entropy, and reward from rollout-side token logprobs, backend parity is part of the learning algorithm. Treating a migration as an inference-only upgrade can corrupt the signals your RLHF or GRPO loop depends on.
Processed Logprobs
The first issue was semantic. vLLM V1 returned logprobs from the raw model outputs by default, before sampling transformations such as temperature scaling, penalties, and top-k or top-p filtering. PipelineRL expected logprobs from the processed distribution used by the sampler.
That difference changes the meaning of the numbers flowing into policy ratios. Setting logprobs-mode=processed_logprobs fixed the obvious mean offset, but it did not fully restore the V0 training trajectory. The remaining divergence showed up in clip rate, KL, entropy, and reward, which meant the team had to keep looking below the objective layer.
For anyone building AI inference infrastructure around post-training, this is the main trap. A rollout server can produce valid tokens and still produce the wrong training signal for the objective consuming those tokens.
Runtime Defaults
The next problem was configuration parity. The early V1 run inherited V1-specific defaults for prefix caching and async scheduling. Those defaults may be useful in normal serving, but ServiceNow was trying to answer a narrower migration question: can the V1 backend reproduce the V0 reference behavior for the same online RL workload?
For the parity run, the team made those choices explicit:
| Setting | Parity choice |
|---|---|
use_v1 | true |
logprobs-mode | processed_logprobs |
enable-prefix-caching | false |
async-scheduling | false |
Prefix caching was especially relevant because the actor handled repeated prefixes, concurrent requests, async scheduling, and inflight weight updates. A cache hit can reuse state computed before a weight update if the cache policy does not respect that boundary. Disabling prefix caching removed one V1-only difference from the comparison.
Inflight Weight Updates
Online RL also makes weight synchronization part of the serving contract. ServiceNow did not initially make V1 stricter by draining requests and clearing caches on every update, because that would have tested a different behavior. The first target was to match the old V0 wrapper pattern: pause at an engine boundary, load new weights, and resume without explicit cached-state invalidation.
The V1 analogue used pause_generation(mode="keep", clear_cache=False), a collective RPC weight update, and then resume_generation(). The key details were mode="keep" and clear_cache=False, because those matched the old inflight update model more closely than aborting or draining all work.
Lag became a runtime diagnostic. The corrected V1 path carried less persistent rollout-server lag than the initial V1 attempt, which helped explain why the training curves moved back toward the V0 reference.
Final Projection Precision
After fixing logprob semantics and runtime behavior, the remaining gap came from numerical parity. The trainer used an fp32 lm_head for the final projection, so the rollout backend had to match that path. Small differences in logits can become visible in policy ratios, KL, clipping, and reward when the RL update consumes token probabilities directly.
ServiceNow points to the same class of issue in the MiniMax-M1 technical report, where a training/inference token-probability mismatch was traced to the LM output head and fixed by computing the head in fp32. The broader lesson is not that every workload needs the same exact settings. It is that rollout probability semantics, runtime defaults, weight-update behavior, and final projection precision all belong in the migration checklist.
Fix backend correctness first. After the rollout backend matches the trainer’s assumptions, then evaluate objective-side corrections for real async or off-policy mismatch.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.
TPU v5p Inference Speeds Triple With DFlash Block-Diffusion
Google and UCSD researchers released DFlash, a block-diffusion speculative decoding method that achieves a 3.13x average inference speedup on TPU v5p hardware.
Ineffable Intelligence Raises $1.1B for RL-Based Superlearner
David Silver's new AI research lab secured a $1.1 billion seed round at a $5.1 billion valuation to build systems using pure reinforcement learning.
Google Inks Multibillion GB300 Deal With Thinking Machines Lab
Google signed a multibillion-dollar agreement to provide Thinking Machines Lab with access to Nvidia GB300 infrastructure for reinforcement learning.
Boost Model Accuracy With MaxText Post-Training on TPUs
Google's MaxText adds SFT and Reinforcement Learning support for single-host TPUs, enabling efficient LLM refinement with GRPO and Tunix integration.