The Rise of World Models
Meta’s chief AI scientist, Yann LeCun, is stepping away from the company with the kind of finality that only comes after years of pushing against the company’s chosen path. He’d long warned that LLMs would eventually hit a wall—that they cannot truly reason about the world if they’ve never actually known it
While Meta stayed focused on even bigger text models, LeCun kept charting his own course towards world models. His work sat on JEPA, latent prediction and structured simulation. Meta chased quick cycles with chief AI officer Alexandr Wang; LeCun was committed to the long ascent.
And now, LeCun’s departure comes at a time when world models are beginning to break out across the field.
Everyone’s Building the World
Fei-Fei Li’s World Labs released Marble. Google DeepMind rolled out SIMA 2 and Genie 3. Tencent pushed HunyuanWorld and FlashWorld into open source. MBZUAI launched PAN as a full generative latent prediction stack.
OpenAI is circling its own foundational world model as it prepares Sora 3. Even NVIDIA’s Cosmos team has been zeroing in on the same territory. Moreover, a series of other startups like Decart and Odyssey are building world models and putting out demos for free.
Fei Fei Li’s Promising Bet
World Labs’ Marble is the most complete world-model release to date. It builds full 3D worlds from text, images, video or even rough sketches. It supports object edits, style shifts and structural changes, and exports in splats, meshes or video. It even brings a sculpting tool called Chisel, which separates structure from surface.
World Labs said Marble is an early step towards broader spatial intelligence. “Future world models will let humans and agents alike interact with generated worlds in new ways,” the company suggests.
Marble feels like a creation suite and not a lab demo, now open to the public after a two-month beta. World Labs also introduced Marble Labs, a workspace for creators to explore workflows, case studies and documentation. “It is where artists, engineers and designers push the boundaries of world models.”
Li calls it a step towards broader spatial intelligence. It does feel like the early layer of a real simulation stack.
Then Comes DeepMind—and China
DeepMind pushed the second major leap with SIMA 2. The agent sits inside 3D environments and learns a bit like a partner learns in a game. According to the team, interactions now “feel less like giving commands and more like collaborating with a companion who can reason about the task at hand.”
SIMA 2 uses Gemini to plan, describes its intent before acting, handles new games, transfers skills from one world to another and learns from its own failures after the first phase of training.
Notably, DeepMind also tested SIMA 2 inside Genie 3’s generated spaces. The agent entered these worlds, found its bearings, and followed instructions. With self-improvement at its core, “the agent can improve on previously failed tasks entirely independently of human-generated demonstrations.”
Genie 3 sits at the heart of this shift. It generates real-time 3D environments at 24 frames per second. The scenes stay stable for minutes, while Genie 2 could barely hold seconds. The access is limited. The signal is clear. Stable, navigable, controllable worlds are getting close to useful form.
[AIM Network Deep Dive]
In this episode of Front Page, we decode the battles shaping the future of AI, compute and human intent. No noise. No hype. Just the signal that matters. And world models aren’t just another chatbot moment, but a foundational shift towards world-aware AI.
Then China stepped in. Tencent dropped a trio of open-source world models: HunyuanWorld 1.0, HunyuanWorld Voyager and FlashWorld. They build explorable 3D scenes from a single prompt or image, and jumped to the top of the open source charts within hours.
The company claims that the model can create high-quality 3D worlds in five seconds with a single GPU.
Tencent calls world models the next frontier. It frames spatial intelligence as the training layer for robots and agents. It sees these models as the way to teach geometry, interactions, and tasks in a safe simulation.
OpenAI Stays Quieter
OpenAI has said surprisingly little about this. But the breadcrumbs still point in the same direction. Community forums are now picking up pieces that OpenAI is indeed working on a model similar to Genie 3.
Work from Sora shows interest in action-conditioned prediction. Internally, the push towards a foundational world model is no longer a secret. Sora struggled with physics. Its failures in extreme tests forced the team to rethink training. The next version can’t just render a world; it will need to operate inside one.
OpenAI cannot sit this shift out. Even during the launch of GPT-4.5 back in February, the company hinted towards this. “Unsupervised learning increases world model accuracy and intuition. Models like GPT‑3.5, GPT‑4, and GPT‑4.5 advance this paradigm,” the company said in the blog.
This is the line LeCun pushed while everyone else kept stacking bigger text models.
The Debate is On!
There is another one that is making the rounds, the one LeCun himself spoke about. MBZUAI’s PAN is the most technical jump in this entire wave. PAN uses generative latent prediction to integrate perception, state, action and causality into a single chain.
The model evolves a latent state based on natural language actions. It decodes that state into small video clips. This keeps the narrative stable. The model remembers what happened, what is happening and what should come next. PAN uses a causal Swin diffusion decoder to reduce drift during long runs.
It is one of the first systems to show long rollouts that do not fall apart. “There is no question that world models should perform predictions in latent space,” LeCun said.
That’s where the divide in the research community now sits. LeCun wants decoder-free joint embedding approaches, such as JEPA. He says latent prediction with collapse prevention builds a stronger internal structure. Eric Xing from MBZUAI and others push full generative decoders. They want the model to reconstruct the world frame by frame to maintain the internal physics accuracy.
The field now has all the signals of a shift. Language models are not enough. The gap between research demos and commercial products is closing fast. It is the start of a new phase in AI.
[Developers’ Meetup] Building Tomorrow: Where Cloud, AI & Data Converge
Join us at the Dentsu Global Services Developers’ Meet Up 2025 to explore this powerful synergy that’s reshaping innovation. Engage with industry leaders, discover cutting-edge tools and gain insights into how this convergence is powering the next era of digital transformation.
📍 November 22 – Bengaluru
📍 December 6 – Pune
Very good insights from this post #AIM
🙏