
Elon Musk's xAI joins the "World Model" competition. Will "visual models" be the next "large language models"?

The next battleground for AI competitions has become clear: moving from the textual world to the physical world. In this competition called "World Models," Elon Musk's xAI has quietly entered the fray alongside NVIDIA experts, competing with giants like Google and Meta. xAI plans to be the first to apply this technology to AI game generation and explore its applications in robotic systems. Google infers that future video models will become as intelligent as language models
Author: Long Yue
Source: Hard AI
The battlefield in the field of artificial intelligence is spreading from large language models to a more cutting-edge area—"World Models," which can understand and simulate the real physical world. xAI has quietly joined this competition, competing alongside tech giants like Google and Meta.
According to a report by the Financial Times on October 12, Musk's startup xAI hired AI experts from chip giant NVIDIA this summer, specifically for the research and development of world models. Unlike large language models that rely on text, world models are trained on vast amounts of video and robotic data, aiming to grasp the physical laws of the real world.
"The video models of the future will be as intelligent as language models," Google researchers stated in a paper. NVIDIA also mentioned last month that the potential market size for world models could be close to the total amount of the current global economy.
Advance Troops: xAI's Game "Surprise Attack" and Robotic Ambitions
To secure a place in this competition, xAI is actively recruiting.
The company has hired two AI researchers, Zeeshan Patel and Ethan He, from NVIDIA, who have extensive experience in the field of world models. NVIDIA has been a leader in this technology with its Omniverse platform used for creating and running simulations.
Insiders revealed that xAI's first commercial application planned for world models is in the gaming sector, aimed at generating interactive 3D environments. This dynamic quickly attracted market attention, as it not only signals a clear path for xAI's commercialization but also highlights the immense potential of world models as the next generation of AI technology.
Musk himself confirmed on social media platform X that xAI will "release an outstanding AI-generated game by the end of next year." In the long run, these technologies may eventually be applied to artificial intelligence systems in robotics.

xAI's job postings also confirm its development direction. The company is recruiting technical personnel in the field of image and video generation for its "omni team," with salaries ranging from $180,000 to $440,000, dedicated to "creating magical AI experiences beyond text."
Additionally, the company is hiring "video game mentors" at an hourly rate of $45 to $100 to train its AI model Grok in video game production.
Paradigm Shift: The "GPT Moment" of Visual Models
xAI's high-profile entry coincides with a key industry prediction emerging: the video models of the future will be as intelligent as language models. A recent paper from Google pointed out that its video model Veo 3 is demonstrating "emergent capabilities" similar to those of large language models (LLMs).
Just as LLMs learned additional skills such as mathematics and creative writing through the simple task of "next word prediction," video models are also beginning to unlock a range of surprising capabilities, such as object segmentation, edge detection, and simulated tool use, through "next frame prediction," all without specialized training

Google researcher Jack Clark wrote in a paper: "We believe that just as natural language processing (NLP) has shifted from task-specific models to general models, the field of machine vision may also undergo a similar transformation through video models—a 'GPT-3 moment for the visual domain.'"
They likened the process of generating video frame by frame to the "chain-of-thought" in language models, calling it "chain-of-frames," and argued that this enables video models to reason across time and space.
This discovery is significant, suggesting that by developing smarter video models, we may be able to achieve highly capable robotic "agents."
Prospects and Reality: High Costs and the Lack of "Vision"
Despite the enticing prospects, the road to world models is not smooth. Currently, the technology still faces enormous technical challenges, the most significant of which is the extremely high cost of finding and processing sufficient training data to simulate the real world.
At the same time, there is a sober examination of AI's role in the industry. Michael Douse, the publishing director of Larian Studios, the developer of the popular game "Baldur's Gate 3," stated this week on X that AI cannot solve the "big problems" in the gaming industry, namely "leadership and vision."
He added that what the industry needs is not "more game loops produced mathematically and trained by psychology," but a more diverse expression of the world. This represents a common viewpoint: pure technological breakthroughs alone cannot guarantee the creation of commercial products that truly resonate with people.
Despite the numerous challenges, xAI's entry undoubtedly adds fuel to the competition for world models.
The focus of AI is irreversibly shifting from pure digital information processing to the simulation and interaction with complex physical realities. Whether visual models can replicate the brilliance of large language models and usher in their own "GPT moment" will not only determine the next generation of AI dominance but may also reshape our fundamental relationship with the digital and physical worlds.
This article is from WeChat public account "Hard AI". For more cutting-edge AI news, please click here.


