Ai2's 4B MolmoMotion Maps Text Instructions to 3D Trajectories

On June 17, 2026, the Allen Institute for AI (Ai2) released MolmoMotion, an open-source vision-language model designed for language-guided 3D motion forecasting. The 4-billion parameter model bridges the gap between high-level text instructions and physical movement. Instead of predicting future video pixels, MolmoMotion calculates the physical trajectory of specific objects in 3D space.

3D Point Trajectory Architecture

MolmoMotion builds on the Molmo 2 (4B) architecture. It processes a short history of RGB video frames, user-specified 2D query points, and a natural language action description. The output is a trajectory of object-attached 3D points in world space, measured in meters relative to the initial camera frame.

Ai2 released two primary autoregressive model variants tuned for different observation constraints.

Model Variant	History Frames (H)	Future Frames (F)	Primary Use Case
MolmoMotion-4B-H3-F30	3	30	Typical video input (approx. 2 seconds at 15 fps)
MolmoMotion-4B-H1-F32	1	32	Single query keyframe available

The training pipeline utilizes two stages. Stage 1 involves pretraining on short-horizon motion for 40,000 steps. Stage 2 executes long-horizon finetuning for 10,000 steps to achieve the final two-second prediction capability.

Evaluation Ecosystem

Alongside the model weights, Ai2 published the Molmo-Motion-1M dataset, providing one million samples of motion-language alignment. This corpus directly addresses the data scarcity in training embodied AI agents.

To validate performance, Ai2 introduced PointMotionBench. This evaluation suite measures trajectory prediction accuracy using Average Displacement Error (ADE), Final Displacement Error (FDE), and Point-Wise Trajectory (PWT) error. The model achieves state-of-the-art results among open-weight models for 3D spatial grounding.

Hardware and Video Integration

The primary utility of MolmoMotion is its “motion prior,” which enables it to generalize physical movement across different embodiments. When learning a motion from human video, the resulting 3D trajectory remains consistent when transferred to robot hardware executing a pick-and-place task.

The model output also serves as a control mechanism for video generation models. By supplying physical coordinate paths, developers can ensure generated objects follow physically plausible trajectories mandated by the text prompt.

If you are building downstream robotics planning systems, you can access the Apache 2.0-style weights and integration code via the allenai/molmo-motion GitHub repository to begin testing spatial grounding capabilities.

Ai2's 4B MolmoMotion Maps Text Instructions to 3D Trajectories

3D Point Trajectory Architecture

Evaluation Ecosystem

Hardware and Video Integration

Keep Reading

Build a Fast Multilingual OCR with Nemotron-OCR-v2

Volvo EX60 Routes External Camera Feeds to Gemini AI

Gemini API Gains Streaming Voice Translation in 70 Languages

Google PHRM Achieves 6.09% MAPE in Passive Heart Rate Tracking

Google Dreambeans Curates Personal Data Into 14 Daily Cartoons