The latest long article by AI guru Fei-Fei Li: Spatial intelligence is the next frontier of AI "LLMs are too limited and lack a foundation in reality"

Wallstreetcn
2025.11.11 00:45
portai
I'm PortAI, I can summarize articles.

AI godmother Fei-Fei Li pointed out in her latest article that spatial intelligence is the next frontier of AI. She believes that while current LLM technology has changed the world, it still lacks a foundation in reality. She proposed a framework for building world models, emphasizing that spatial intelligence will empower AI and enhance its capabilities in storytelling, creativity, and scientific reasoning. Fei-Fei Li's research aims to break through language limitations and achieve interaction between AI and the physical world

Just now, AI pioneer and Stanford University professor Fei-Fei Li published a new article titled "From Language to the World: Spatial Intelligence is the Next Frontier for AI," which is an in-depth reflection on her 25-year career in AI.

Fei-Fei Li believes that although AI technologies represented by LLMs have profoundly changed the world, they are essentially still "wordsmiths in the dark"—articulate but inexperienced, knowledgeable but lacking real-world grounding.

To enable AI to truly understand and interact with the physical world, it must break through the limitations of language and move towards Spatial Intelligence.

Fei-Fei Li believes that spatial intelligence will be the next frontier for AI, empowering it with the imagination of a storyteller, the action of a first responder, and the spatial reasoning precision of a scientist.

To achieve this goal, she proposed a framework for building world models and elaborated on its three core capabilities, the technical challenges it faces, and its vast application prospects.

In 1950, when computing was merely about automating arithmetic and simple logic, Alan Turing posed a question that still resonates today: Can machines think? The vision he foresaw required extraordinary imagination: intelligence might one day be constructed rather than innate. This insight later sparked an unrelenting scientific exploration known as artificial intelligence (AI). Throughout my 25-year career in AI, I have continued to be inspired by Turing's foresight. But how far are we from this goal? The answer is not simple.

Today, top AI technologies represented by large language models (LLMs) have begun to change the way we acquire and process abstract knowledge. However, they remain wordsmiths in the dark; articulate but inexperienced, knowledgeable but lacking real-world grounding. Spatial intelligence will change the way we create and interact with both real and virtual worlds—it will revolutionize storytelling, creativity, robotics, scientific discovery, and more. This is the next frontier for AI.

The pursuit of visual and spatial intelligence has always been the guiding North Star that led me into this field. For this reason, I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset, which is one of the three key elements that birthed modern AI alongside neural network algorithms and modern computing (such as graphics processing units, GPUs). For this reason, my academic lab at Stanford has been dedicated to combining computer vision with robotic learning for the past decade. It is also for this reason that my co-founders Justin Johnson, Christoph Lassner, Ben Mildenhall, and I established World Labs over a year ago: to fully realize this possibility for the first time

In this article, I will explain what spatial intelligence is, why it is important, and how we can build world models that unlock it—its impact will reshape creativity, embodied intelligence, and human progress.

Spatial Intelligence: The Scaffolding of Human Cognition

AI has never been more exciting. Generative AI models like LLM have moved from research labs into everyday life, becoming tools for billions of people to create, produce, and communicate. They demonstrate capabilities once thought impossible, effortlessly generating coherent text, mountains of code, realistic images, and even short video clips. Whether AI will change the world is no longer a question. By any reasonable definition, it already has.

However, there are still too many areas that remain out of reach. The vision of autonomous robots remains captivating but is still in the speculative stage, far from the everyday devices long promised by futurists. The dream of massively accelerating research in fields like disease treatment, new material discovery, and particle physics largely remains unfulfilled. And the promise of truly understanding AI and empowering human creators—whether helping students grasp complex concepts in molecular chemistry, assisting architects in envisioning spaces, aiding filmmakers in building worlds, or supporting anyone seeking a fully immersive virtual experience—also remains unachieved.

To understand why these capabilities are still out of reach, we need to examine how spatial intelligence has evolved and how it shapes our understanding of the world.

Vision has long been the cornerstone of human intelligence, but its power stems from something more fundamental. Long before animals could build nests, care for young, communicate with language, or establish civilizations, simple perceptual behaviors quietly opened a path to the evolution of intelligence.

This seemingly isolated ability to gather information from the external world, whether through a glimmer of light or the touch of texture, has built a bridge between perception and survival, and over generations, this bridge has become increasingly robust and refined. Layer upon layer of neurons have grown from this bridge, forming a nervous system capable of interpreting the world and coordinating interactions between organisms and their environments. Thus, many scientists speculate that perception and action constitute the core loop driving the evolution of intelligence and are the foundation of nature's creation of our species—the ultimate embodiment of perception, learning, thinking, and action.

Spatial intelligence plays a foundational role in defining how we interact with the physical world. Every day, we rely on it to perform the most ordinary actions: parking by imagining the ever-narrowing gap between the bumper and the curb, catching keys thrown from across the room, navigating crowded sidewalks to avoid collisions, or groggily pouring coffee into a cup without looking. In more extreme cases, firefighters navigate through smoke-filled, collapsing buildings, making instantaneous judgments about structural stability and survival chances, communicating through gestures, body language, and a shared professional instinct that cannot be replaced by words. Meanwhile, children learn about the world through playful interactions with their environment for months or years before they can speak. All of this happens intuitively and naturally—at a level of fluency machines have yet to achieve Spatial intelligence is also the foundation of our imagination and creativity. Storytellers create exceptionally rich worlds in their minds and present them to others through various visual media, from ancient cave paintings to modern films, and immersive video games. Whether children are building sandcastles on the beach or playing "Minecraft" on the computer, spatial imagination forms the basis of interactive experiences in both real and virtual worlds. In many industrial applications, the simulation of objects, scenes, and dynamic interactive environments powers countless key business use cases, from industrial design to digital twins to robotics training.

History is filled with moments where spatial intelligence played a core role in defining the course of civilization. In ancient Greece, Eratosthenes transformed shadows into geometry—at the moment the sun was directly overhead in Syene, he measured a 7-degree angle in Alexandria—thus calculating the circumference of the Earth. Hargreaves' "Jenny" spinning machine revolutionized the textile manufacturing industry through a spatial insight: arranging multiple spindles side by side in a frame, allowing one worker to spin multiple threads simultaneously, increasing productivity eightfold. Watson and Crick discovered the structure of DNA by building 3D molecular models themselves, constantly manipulating metal plates and wires until the spatial arrangement of base pairs clicked perfectly into place. In each case, spatial intelligence propelled the advancement of civilization when scientists and inventors needed to manipulate objects, conceive structures, and reason about physical space—these cannot be captured by words alone.

Spatial intelligence is the scaffolding of our cognitive construction. Whether we are passively observing or actively creating, it is at work. It drives our reasoning and planning, even on the most abstract topics. It is crucial to the way we interact—whether verbally or physically, whether with peers or with the environment itself. While most of us are not revealing cosmic truths like Eratosthenes every day, our daily thinking processes are no different from his—perceiving the complex world through our senses and then using an intuitive understanding based on physical and spatial terms to grasp how it works.

Unfortunately, today's AI cannot think in this way.

Significant progress has indeed been made in recent years. Multimodal large language models (MLLMs) trained on vast amounts of multimedia and text data have introduced some basic spatial awareness, allowing today's AI to analyze images, answer related questions, and generate hyper-realistic images and short videos. Through breakthroughs in sensors and haptic technology, our most advanced robots have begun to manipulate objects and tools in highly constrained environments.

However, frankly speaking, AI's spatial capabilities are far from human levels. Its limitations quickly become apparent. In tasks such as estimating distance, direction, and size, or "mentally" rotating objects by regenerating them from new angles, the performance of the most advanced MLLM models rarely exceeds random guessing. They cannot navigate mazes, identify shortcuts, or predict basic physical phenomena. AI-generated videos—though in their infancy, are indeed very cool—often lose coherence after just a few seconds Although the most advanced AI currently excels in reading, writing, research, and data pattern recognition, these same models have fundamental limitations when it comes to representing or interacting with the physical world. Our view of the world is holistic—not just what we are looking at, but also how everything is spatially related, what it means, and why it matters. Understanding all of this through imagination, reasoning, creation, and interaction—not just description—is the power of spatial intelligence. Without it, AI becomes disconnected from the physical reality it seeks to understand. It cannot effectively drive our cars, guide robots in our homes and hospitals, create entirely new immersive and interactive experiences for learning and entertainment, or accelerate discoveries in materials science and medicine.

Philosopher Ludwig Wittgenstein once wrote, "The limits of my language mean the limits of my world." I am not a philosopher. But I know that, at least for AI, the world is much more than language. Spatial intelligence represents the frontier beyond language—this capability connects imagination, perception, and action, opening up possibilities for machines to truly enhance human life, from healthcare to creativity, from scientific discovery to everyday assistance.

The Next Decade of AI: Building Machines with True Spatial Intelligence

So, how do we build AI with spatial intelligence? How can we enable models to reason with the insight of Eratosthenes, design with the precision of an industrial designer, create with the imagination of a storyteller, and interact with the environment as fluidly as first responders?

Building AI with spatial intelligence requires a grander goal than LLMs: world models, a new type of generative model that possesses the ability to understand, reason, generate, and interact with a semantically, physically, geometrically, and dynamically complex world (whether virtual or real) in ways far beyond what today’s LLMs can achieve. This field is still in its infancy, with current approaches ranging from abstract reasoning models to video generation systems. World Labs was established in early 2024 based on the belief that foundational methods are still being developed, making it a decisive challenge for the next decade.

In this emerging field, it is crucial to establish guiding principles for development. For spatial intelligence, I define world models through three core capabilities:

  1. Generative: World models must be able to generate worlds that are consistent in perception, geometry, and physics.

Unlocking spatial understanding and reasoning requires world models to generate their own simulated worlds. They must be capable of generating an endless variety of simulated worlds that follow semantic or perceptual instructions while remaining consistent in geometry, physics, and dynamics—regardless of whether they represent real or virtual spaces. The research community is actively exploring whether these worlds should represent their inherent geometric structures implicitly or explicitly. Furthermore, in addition to powerful latent representations, I believe a universal world model must also be able to generate a clear, observable world state for many different use cases. In particular, its understanding of the current state must coherently connect with its past—that is, the previous state of the world that led to the current state 2. Multimodal: World models are inherently multimodal

Just like animals and humans, world models should be able to handle multiple forms of input—referred to as "prompts" in the field of generative AI. Given partial information—whether it be images, videos, depth maps, text instructions, gestures, or actions—the world model should predict or generate as complete a state of the world as possible. This requires the model to process visual inputs with the fidelity of real vision and to interpret semantic instructions with equal capability. This allows both agents and humans to communicate with the model about the world through diverse inputs and, in turn, receive varied outputs.

  1. Interactive: World models can output the next state based on input actions

Finally, if actions and/or goals are part of the world model prompts, its output must include the next state of the world, whether implicitly or explicitly represented. When only an action (with or without a target state) is given as input, the world model should produce an output consistent with the previous state of the world, the expected target state (if any), and its semantic meaning, physical laws, and dynamic behaviors. As world models with spatial intelligence become more powerful and robust in reasoning and generative capabilities, it is conceivable that, given a target, the world model itself can not only predict the next state of the world but also predict the next action based on the new state.

The scope of this challenge exceeds any challenges faced by AI in the past.

While language is a purely generative phenomenon in human cognition, the rules governing the world are far more complex. For example, on Earth, gravity governs motion, atomic structures determine how light produces color and brightness, and countless physical laws constrain every interaction. Even the most fantastical and creative worlds are composed of spatial objects and agents that follow their own physical laws and dynamic behaviors. Coordinating all of these—semantics, geometry, dynamics, and physics—consistently requires entirely new approaches. Representing the dimensions of a world is far more complex than representing a one-dimensional sequence signal like language. To achieve a world model capable of providing the kind of general abilities that we humans enjoy, several significant technical hurdles must be overcome. At World Labs, our research team is working towards fundamental progress to achieve this goal.

Here are some examples of our current research topics:

A new, universal training task function: Defining a universal task function as concise and elegant as "next token prediction" in LLMs has always been a core goal of world model research. The complexity of its input and output spaces makes such a function inherently more difficult to formalize. Although there is still much to explore, this objective function and its corresponding representation must reflect geometric and physical laws, respecting the fundamental nature of world models as foundational representations of imagination and reality.

Large-scale training data: Training world models requires data that is far more complex than text organization. The good news is that vast data sources already exist. Collections of images and videos on an internet scale represent rich, accessible training materials—the challenge lies in developing methods that can learn from these two-dimensional image or video frame signals (i.e., RGB) An algorithm for extracting deeper spatial information. Research over the past decade has demonstrated the power of the scaling laws between data volume and model size in language models; the key to unlocking world models lies in constructing architectures that can leverage existing visual data at a considerable scale. Furthermore, I do not underestimate the power of high-quality synthetic data and additional modalities such as depth and tactile information. They complement the internet-scale data in critical steps of the training process. However, the path forward relies on better sensor systems, more robust signal extraction algorithms, and more powerful neural simulation methods.

New model architectures and representation learning: Research on world models will inevitably drive advancements in model architectures and learning algorithms, particularly beyond the current MLLM and video diffusion paradigms. These two paradigms typically tokenize data into one-dimensional or two-dimensional sequences, making simple spatial tasks—such as counting the number of non-repeating chairs in a short video or remembering what a room looked like an hour ago—unnecessarily difficult. Alternative architectures may help, such as 3D or 4D perceptual methods for tokenization, context, and memory. For example, at World Labs, our recent work on a real-time generative frame-based model called RTFM showcases this shift, using spatially-based frames as a form of spatial memory to achieve efficient real-time generation while maintaining the persistence of the generated world.

Clearly, we still face daunting challenges before fully unlocking spatial intelligence through world modeling. This research is not merely a theoretical exercise; it is the core engine of a new class of creative and productivity tools. Progress within World Labs is encouraging. We recently shared a glimpse of Marble with a select group of users, which is the first world model capable of generating and maintaining consistent 3D environments through multimodal input prompts, allowing users and storytellers to explore, interact, and further build in their creative workflows. We are working to make it publicly available as soon as possible!

Marble is just the first step in our creation of a truly spatially intelligent world model. As progress accelerates, researchers, engineers, users, and business leaders are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence at a whole new level—an achievement that will unlock a core capability that is still widely lacking in today's AI systems.

Building a Better World for People Using World Models

The motivation behind developing AI is crucial. As one of the scientists who helped usher in the modern AI era, my motivation has always been clear: AI must enhance human capabilities, not replace them. For years, I have been committed to aligning the development, deployment, and governance of AI with human needs. Today, extreme narratives of technological utopia and apocalypse abound, but I continue to hold a more pragmatic view: AI is developed by people, used by people, and governed by people. It must always respect human agency and dignity. Its magic lies in extending our capabilities; making us more creative, more connected, more efficient, and more fulfilled. Spatial intelligence embodies this vision—AI empowering human creators, caregivers, scientists, and dreamers to achieve what was once thought impossible This belief drives my commitment to viewing spatial intelligence as the next great frontier of AI.

The applications of spatial intelligence span different timelines. Creative tools are emerging—World Labs' Marble has already put these capabilities into the hands of creators and storytellers. As we refine the loop between perception and action, robotics represents an ambitious mid-term goal. The most transformative scientific applications will take longer, but are expected to have a profound impact on human prosperity.

Across all these timelines, several areas stand out for their potential to reshape human capabilities. This requires a tremendous collective effort, far beyond what a single team or company can achieve. It necessitates the participation of the entire AI ecosystem—researchers, innovators, entrepreneurs, companies, and even policymakers—working together to realize a shared vision. But this vision is worth pursuing. Here’s what this future entails:

Creativity: Infusing superpowers into storytelling and immersive experiences

“Creativity is intelligence having fun.” This is one of my personal hero Albert Einstein's favorite quotes. Long before written language emerged, humans were telling stories—painting them on cave walls, passing them down through generations, building entire cultures on shared narratives. Stories are how we understand the world, connect across time and space, explore the meaning of humanity, and most importantly, find meaning in life and discover love within ourselves. Today, spatial intelligence has the potential to change the way we create and experience narratives, respecting their fundamental importance while extending their impact from entertainment to education, from design to architecture.

World Labs' Marble platform will empower filmmakers, game designers, architects, and various storytellers with unprecedented spatial capabilities and editable control, enabling them to quickly create and iterate fully explorable 3D worlds without the overhead of traditional 3D design software. Creative acts remain as vital and human as ever; AI tools merely amplify and accelerate the achievements that creators can reach. This includes:

New dimensions of narrative experiences: Filmmakers and game designers are using Marble to create complete worlds, unrestricted by budget or geography, exploring various scenes and perspectives that are difficult to handle in traditional production processes. As the boundaries between different forms of media and entertainment blur, we are approaching a brand new interactive experience that merges art, simulation, and gaming—a personalized world where anyone, not just studios, can create and inhabit their own stories. With the rise of updates and quicker ways to elevate concepts and storyboards into complete experiences, narratives will no longer be confined to a single medium, allowing creators to freely build worlds with a common thread across countless interfaces and platforms.

Spatial storytelling through design: Essentially, every object manufactured or space constructed must be designed in virtual 3D before its physical creation. This process is highly iterative and costly in terms of time and money. With models equipped with spatial intelligence, architects can quickly visualize structures before spending months on design, walking through spaces that do not yet exist—essentially telling the story of how we might live, The story of work and gathering. Industrial and fashion designers can instantly translate imagination into form, exploring how objects interact with the human body and space.

A brand new immersive and interactive experience: Experience itself is one of the deepest ways we as a species create meaning. Throughout human history, there has only been a single 3D world: the physical world we all share. It is only in recent decades, through gaming and early virtual reality (VR), that we have begun to glimpse what it means to share alternative worlds of our own creation. Now, spatial intelligence combined with new device forms (such as VR and extended reality (XR) headsets and immersive displays) has enhanced these experiences in unprecedented ways. We are approaching a future where stepping into a fully realized multidimensional world will feel as natural as opening a book. Spatial intelligence makes world-building no longer the exclusive domain of studios with professional production teams, but open to individual creators, educators, and anyone with a vision to share.

Robots: The Practice of Embodied Intelligence

From insects to humans, animals rely on spatial intelligence to understand, navigate, and interact with their world. Robots are no exception. Machines with spatial perception have been a dream since the inception of this field, including my own work with my students and collaborators at Stanford Research Lab. This is why I am so excited about the possibilities of utilizing the models being built by World Labs.

Expanding Robot Learning through World Models: The progress of robot learning depends on a scalable solution for viable training data. Given the vast state space that robots must learn to understand, reason, plan, and interact, many speculate that a combination of internet data, synthetic simulations, and real-world human demonstration captures will be necessary to truly create robots with generalization capabilities. However, unlike language models, training data is scarce in today's robotics research. World models will play a decisive role here. As their perceptual fidelity and computational efficiency improve, the outputs of world models can quickly narrow the gap between simulation and reality. This, in turn, will help train robots in countless states, interactions, and environments.

Becoming Partners and Collaborators: Robots as collaborators with humans, whether assisting scientists in the lab or helping elderly individuals living alone, can expand parts of the labor market that urgently need more workforce and productivity. But to achieve this, spatial intelligence must be able to perceive, reason, plan, and act, while—most importantly—maintaining empathetic alignment with human goals and behaviors. For example, a lab robot could handle instruments, allowing scientists to focus on tasks that require dexterity or reasoning, while a home assistant could help elderly individuals cook without diminishing their joy or autonomy. A truly spatially intelligent world model that can predict the next state or even the next action in line with this expectation is crucial for achieving this goal.

Expanding Forms of Embodied Intelligence: Humanoid robots play a role in the world we build for ourselves. However, the full benefits of innovation will come from more diverse designs: nanobots delivering medication, soft robots navigating tight spaces, and machines built for the deep sea or outer space Regardless of their form, future spatial intelligent models must integrate the environments in which these robots operate, as well as their own embodied perception and movement. However, a key challenge in developing these robots is the lack of training data across these various embodied morphological factors. World models will play a crucial role in simulated data, training environments, and benchmarking tasks for these works.

A Longer-Term Future: Science, Healthcare, and Education

In addition to creative and robotic applications, the far-reaching impact of spatial intelligence will extend to areas where AI can enhance human capabilities in ways that save lives and accelerate discoveries. Below, I highlight three application areas that can bring about profound transformations, although it is evident that the use cases for spatial intelligence are very broad across more industries.

In scientific research, systems equipped with spatial intelligence can simulate experiments, test hypotheses in parallel, and explore environments inaccessible to humans—from the deep sea to distant planets. This technology can revolutionize computational modeling in fields such as climate science and materials research. By combining multidimensional simulations with real-world data collection, these tools can lower computational barriers and expand the range of what each laboratory can observe and understand.

In the healthcare sector, spatial intelligence will reshape everything from the laboratory to the bedside. At Stanford, my students and collaborators have been working with hospitals, elder care facilities, and home patients for years. This experience has convinced me of the transformative potential of spatial intelligence in this area. AI can accelerate drug discovery through multidimensional modeling of molecular interactions, enhance diagnostics by helping radiologists identify patterns in medical imaging, and implement environmental monitoring systems that support patients and caregivers without replacing the interpersonal connections necessary for healing, not to mention the potential of robots to assist our healthcare professionals and patients in many different environments.

In the education sector, spatial intelligence can enable immersive learning, making abstract or complex concepts tangible and creating iterative experiences that are crucial for how our brains and bodies learn. In the age of AI, the demand for faster and more effective learning and retraining is particularly important for school-aged children and adults alike. Students can explore cellular mechanisms in multidimensional spaces or walk through historical events. Teachers gain tools for personalized instruction through interactive environments. Professionals—from surgeons to engineers—can safely practice complex skills in realistic simulations.

In all these areas, the possibilities are limitless, but the goal remains the same: AI enhances human expertise, accelerates human discovery, and amplifies human care—rather than replacing the judgment, creativity, and empathy that are at the core of being human.

Conclusion

The past decade has witnessed AI becoming a global phenomenon and a turning point in technology, economics, and even geopolitics. But as a researcher, educator, and now entrepreneur, what inspires me most is still the spirit behind the question posed by Turing 75 years ago. I still share his sense of wonder. It is this feeling that energizes me every day for the challenges of spatial intelligence For the first time in history, we are hopeful to build machines that are so well-coordinated with the physical world that we can rely on them as true partners in addressing the greatest challenges we face. Whether accelerating our understanding of diseases in the laboratory, fundamentally changing the way we tell stories, or supporting us in our most vulnerable moments due to illness, injury, or aging, we are at the brink of a technological breakthrough that will enhance the quality of life we cherish most. This is a vision of a deeper, richer, and more powerful life.

After the first glimmer of spatial intelligence was released in ancient animals nearly 500 million years ago, we are fortunate to be the generation of technologists who may soon bestow the same capabilities upon machines—and to harness these abilities for the benefit of people around the world. Without spatial intelligence, our dreams of truly intelligent machines would not be complete.

Author of this article: AI Hanwuji, Source: AI Hanwuji, Original title: "AI Mother Fei-Fei Li's Latest Long Article: Spatial Intelligence is the Next Frontier of AI 'LLMs are too limited and lack a foundation in reality'"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk