How Will Robots Think? Meta Answers with Its New V-JEPA 2 Model

On Wednesday, Meta announced the launch of its new artificial intelligence model, V-JEPA 2.

As a “world model,” it’s designed to help AI agents understand and logically interact with the world around them.

With this innovation, Meta aims to develop agents capable of thinking before they act-a significant step toward achieving Advanced Machine Intelligence (AMI) and its applications in robotics.

Alongside the model, the company introduced three new benchmarks to evaluate how well current models reason about the physical world based on video clips.

What is the V-JEPA 2 Model?

V-JEPA 2 builds upon the V-JEPA model Meta released last year.

The new model was trained on over a million hours of video.

This massive training dataset is intended to empower robots and other AI agents to operate effectively in the physical world, grasping concepts like gravity and predicting their effects on a sequence of events. Such associative ability is akin to the intuitive sense that develops in young children and animals.

For instance, when you throw a ball to a dog, it anticipates the ball will bounce off the ground and come back up, running toward its expected landing spot rather than its current position.

Meta provided illustrative examples: if a robot holding a plate and a spatula approaches a stove with cooked eggs, the system predicts the most likely next step is using the spatula to transfer the eggs to the plate.

An explanatory graphic of Meta's V-JEPA 2 AI model

According to Meta, V-JEPA 2 is about 30 times faster than NVIDIA’s Cosmos model, which also aims to enhance physical-world intelligence, though it’s worth noting that Meta may be using different evaluation criteria than NVIDIA.

The Architecture and Training of V-JEPA 2

The V-JEPA 2 model, with its 1.2 billion parameters, operates on Meta’s Joint-Embedding Predictive Architecture (JEPA).

It consists of two main components: an encoder that processes raw video to produce embeddings capturing useful semantic information about the observed state of the world, and a predictor that takes a video embedding and additional context about what to predict, then outputs the predicted embeddings.

Meta trained V-JEPA 2 using self-supervised learning from video. This approach allowed the model to be trained on video without needing initial human annotations. The training process involves two phases: a pre-training phase without specified actions, followed by an action-conditioned fine-tuning stage.

In the first phase, over a million hours of video and one million images from diverse sources were used. This rich visual data helps the model learn a great deal about how the world works, including how people interact with objects, how objects move in the physical world, and how they interact with each other.

After this pre-training phase, Meta found that the model demonstrated key understanding and prediction capabilities.

For example, when a lightweight attention-based readout system is trained on top of the frozen encoder features, V-JEPA 2 achieves exceptional performance on the Something-Something v2 action recognition task, which relies on understanding motion.

Similarly, when a readout system is trained on the frozen encoder and predictor features, V-JEPA 2 sets a new state-of-the-art performance on the Epic-Kitchens-100 action anticipation task, which involves predicting the action (composed of a verb and noun) that will occur one second in the future from a first-person video.

Finally, aligning V-JEPA 2 with a language model results in cutting-edge performance on video question-answering benchmarks like the Perception Test and TempCompass.

The second training phase focuses on making the model more useful for planning by using robot data, which includes visual observations (video) and the control actions the robot was taking. Meta incorporates this data into the JEPA training procedure by providing the predictor with action information.

After training on this additional data, the predictor learns to account for specific actions when making predictions and can then be used for control.

A large amount of robot data isn’t required for this second phase. In its technical report, Meta explained that training with just 62 hours of robot data already produces a model suitable for planning and control.

How Can the Model’s Capabilities Be Used in Different Tasks?

Thanks to these abilities, the V-JEPA 2 model can help robots interact with unfamiliar objects and environments, a concept known as zero-shot robot planning.

Meta demonstrated that robots can use V-JEPA 2 to perform tasks like reaching for an object, picking it up, or placing it in a new location.

For short-horizon tasks, such as picking or placing an object, the goal is defined as an image.

The robot uses the V-JEPA 2 encoder to obtain embeddings of the current state and the goal state.

From its current observed state, the robot then plans its actions by using the predictor to visualize the consequences of a set of candidate actions and evaluate them based on how close they get to the desired goal.

At each time step, the robot replans and executes the next highest-rated action toward that goal via model-predictive control.

For longer-horizon tasks, like picking up an object and placing it in the correct spot, a series of visual sub-goals is defined for the robot to achieve sequentially, similar to visual imitation learning observed in humans.

With these visual sub-goals, V-JEPA 2 achieves success rates between 65% and 80% for tasks involving picking and placing new objects in new, previously unseen environments.

Meta is making the source code and model checkpoints available for commercial and research applications, hoping to build a broad community around this research.

Three New Benchmarks for Evaluating AI Reasoning

In a related announcement, Meta launched three new benchmarks to assess how well current models understand and reason about the physical world through video.

Although humans perform very well on these benchmarks (85% – 95%), current models, including V-JEPA 2, still show a significant gap.

The first benchmark is IntPhys 2, designed specifically to measure a model’s ability to distinguish between physically plausible and implausible scenarios, building upon and extending the previous IntPhys benchmark.

The second, Minimal Video Pairs (MVPBench), measures the physical-world understanding of video-language models through multiple-choice questions designed to mitigate common shortcut solutions.

Finally, the CausalVQA benchmark measures the ability of video-language models to answer questions about cause and effect in the physical world. This includes questions about counterfactuals (“what would have happened if…”), predictions (“what might happen next”), and planning (“what action should be taken next to achieve a goal”).

Meta noted that a leaderboard on the Hugging Face platform will track the progress of models on these new benchmarks.

Meta quoted Yann LeCun, the company’s Chief AI Scientist, as saying, “We believe world models will unlock a new era for robotics and enable real-world AI agents to help with household chores and physical tasks without needing astronomical amounts of robot training data.” The ultimate goal of these efforts is to achieve Advanced Machine Intelligence (AMI).

Looking ahead, Meta plans to continue exploring several areas in its work on world models.

Currently, V-JEPA 2 learns and makes predictions at a single timescale. However, many tasks require planning across multiple timescales.

Therefore, Meta wants to focus on training hierarchical JEPA models capable of learning, reasoning, and planning across multiple temporal and spatial scales.

Another important direction is multi-modal JEPA models that can make predictions using a variety of senses, including sight, sound, and touch.

Check out Meta’s official announcement on these updates

How Will Robots Think? Meta Answers with Its New V-JEPA 2 Model

What is the V-JEPA 2 Model?

The Architecture and Training of V-JEPA 2

How Can the Model’s Capabilities Be Used in Different Tasks?

Three New Benchmarks for Evaluating AI Reasoning

Related Articles

Meta to Launch New AI Parental Controls for Teens Amid Safety Concerns

Apple’s AI Talent Drain: Key Search Executive Departs for Meta

Meta Deploys Tents to Speed Up AI Data Center Builds

Major Blow to Apple: Meta Poaches Its AI Mastermind

Comments

No Comments Yet