Ai2's 4B MolmoMotion Maps Text Instructions to 3D Trajectories
Ai2 released MolmoMotion, an open-source 4B parameter model that predicts precise 3D physical trajectories from RGB video and natural language.
On June 17, 2026, the Allen Institute for AI (Ai2) released MolmoMotion, an open-source vision-language model designed for language-guided 3D motion forecasting. The 4-billion parameter model bridges the gap between high-level text instructions and physical movement. Instead of predicting future video pixels, MolmoMotion calculates the physical trajectory of specific objects in 3D space.
3D Point Trajectory Architecture
MolmoMotion builds on the Molmo 2 (4B) architecture. It processes a short history of RGB video frames, user-specified 2D query points, and a natural language action description. The output is a trajectory of object-attached 3D points in world space, measured in meters relative to the initial camera frame.
Ai2 released two primary autoregressive model variants tuned for different observation constraints.
| Model Variant | History Frames (H) | Future Frames (F) | Primary Use Case |
|---|---|---|---|
| MolmoMotion-4B-H3-F30 | 3 | 30 | Typical video input (approx. 2 seconds at 15 fps) |
| MolmoMotion-4B-H1-F32 | 1 | 32 | Single query keyframe available |
The training pipeline utilizes two stages. Stage 1 involves pretraining on short-horizon motion for 40,000 steps. Stage 2 executes long-horizon finetuning for 10,000 steps to achieve the final two-second prediction capability.
Evaluation Ecosystem
Alongside the model weights, Ai2 published the Molmo-Motion-1M dataset, providing one million samples of motion-language alignment. This corpus directly addresses the data scarcity in training embodied AI agents.
To validate performance, Ai2 introduced PointMotionBench. This evaluation suite measures trajectory prediction accuracy using Average Displacement Error (ADE), Final Displacement Error (FDE), and Point-Wise Trajectory (PWT) error. The model achieves state-of-the-art results among open-weight models for 3D spatial grounding.
Hardware and Video Integration
The primary utility of MolmoMotion is its “motion prior,” which enables it to generalize physical movement across different embodiments. When learning a motion from human video, the resulting 3D trajectory remains consistent when transferred to robot hardware executing a pick-and-place task.
The model output also serves as a control mechanism for video generation models. By supplying physical coordinate paths, developers can ensure generated objects follow physically plausible trajectories mandated by the text prompt.
If you are building downstream robotics planning systems, you can access the Apache 2.0-style weights and integration code via the allenai/molmo-motion GitHub repository to begin testing spatial grounding capabilities.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build a Fast Multilingual OCR with Nemotron-OCR-v2
Learn how to deploy NVIDIA Nemotron-OCR-v2 for high-speed document extraction across six languages using synthetic data and GPU acceleration.
Volvo EX60 Routes External Camera Feeds to Gemini AI
Google and Volvo are integrating a specialized automotive version of Gemini into the EX60 SUV to process real-time external camera feeds for parking compliance.
Gemini API Gains Streaming Voice Translation in 70 Languages
Google released Gemini 3.5 Live Translate, a streaming speech-to-speech model supporting over 70 languages with near real-time latency and native API access.
Google PHRM Achieves 6.09% MAPE in Passive Heart Rate Tracking
Google Research detailed a passive monitoring system that uses 8-second facial videos captured during routine smartphone unlocks to track resting heart rate.
Google Dreambeans Curates Personal Data Into 14 Daily Cartoons
Google Labs has introduced Dreambeans, an experimental iOS and Android app that uses the Nano Banana 2 model to transform personal data into daily cartoons.