Ai Engineering 2 min read

Ai2's 4B MolmoMotion Maps Text Instructions to 3D Trajectories

Ai2 released MolmoMotion, an open-source 4B parameter model that predicts precise 3D physical trajectories from RGB video and natural language.

On June 17, 2026, the Allen Institute for AI (Ai2) released MolmoMotion, an open-source vision-language model designed for language-guided 3D motion forecasting. The 4-billion parameter model bridges the gap between high-level text instructions and physical movement. Instead of predicting future video pixels, MolmoMotion calculates the physical trajectory of specific objects in 3D space.

3D Point Trajectory Architecture

MolmoMotion builds on the Molmo 2 (4B) architecture. It processes a short history of RGB video frames, user-specified 2D query points, and a natural language action description. The output is a trajectory of object-attached 3D points in world space, measured in meters relative to the initial camera frame.

Ai2 released two primary autoregressive model variants tuned for different observation constraints.

Model VariantHistory Frames (H)Future Frames (F)Primary Use Case
MolmoMotion-4B-H3-F30330Typical video input (approx. 2 seconds at 15 fps)
MolmoMotion-4B-H1-F32132Single query keyframe available

The training pipeline utilizes two stages. Stage 1 involves pretraining on short-horizon motion for 40,000 steps. Stage 2 executes long-horizon finetuning for 10,000 steps to achieve the final two-second prediction capability.

Evaluation Ecosystem

Alongside the model weights, Ai2 published the Molmo-Motion-1M dataset, providing one million samples of motion-language alignment. This corpus directly addresses the data scarcity in training embodied AI agents.

To validate performance, Ai2 introduced PointMotionBench. This evaluation suite measures trajectory prediction accuracy using Average Displacement Error (ADE), Final Displacement Error (FDE), and Point-Wise Trajectory (PWT) error. The model achieves state-of-the-art results among open-weight models for 3D spatial grounding.

Hardware and Video Integration

The primary utility of MolmoMotion is its “motion prior,” which enables it to generalize physical movement across different embodiments. When learning a motion from human video, the resulting 3D trajectory remains consistent when transferred to robot hardware executing a pick-and-place task.

The model output also serves as a control mechanism for video generation models. By supplying physical coordinate paths, developers can ensure generated objects follow physically plausible trajectories mandated by the text prompt.

If you are building downstream robotics planning systems, you can access the Apache 2.0-style weights and integration code via the allenai/molmo-motion GitHub repository to begin testing spatial grounding capabilities.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading