Luo Fuli's first Xiaomi achievement! Open-source embodied large model

Luo Fuli joined Xiaomi for less than 10 days and published her first paper as a core author. The MiMo team proposed and open-sourced the world's first cross-embodied base model MiMo-Embodied, integrating autonomous driving and embodied intelligence. This model performed excellently in 29 benchmark tests, breaking the domain gap between indoor operations and outdoor driving, achieving state-of-the-art performance across fields

Less than 10 days after officially joining Xiaomi, Luo Fuli has released her first paper!

In this research from the MiMo team (focusing on spatial intelligence), Luo Fuli serves as the core author as the team leader, while Xiaomi's autonomous driving team chief scientist Chen Long acts as the project leader.

The most striking aspect of this research is the cross-disciplinary integration of embodied intelligence and autonomous driving.

To address the knowledge transfer challenges between self-driving and embodied operation scenarios, the MiMo team has proposed and open-sourced the world's first cross-embodied (X - Embodied) foundational model that bridges these two fields—MiMo-Embodied.

In terms of practical performance, MiMo-Embodied has topped the charts in all 29 benchmarks related to autonomous driving and embodied intelligence!

Whether it's environmental perception and planning for driving, or grabbing and navigation for robots, it aims to cover it all.

Embodiment and Intelligent Driving, Xiaomi wants it all!

As mentioned, the open-sourced MiMo-Embodied is the industry's first open-source unified multimodal foundational model that successfully integrates the two fields of autonomous driving and embodied intelligence (Embodied AI).

It is based on the MiMo-VL architecture, constructing a high-quality dataset that encompasses general vision, embodied tasks, and driving scenarios, and employs a progressive four-stage training strategy that includes Chain of Thought (CoT) and Reinforcement Learning (RL), effectively breaking the domain gap between indoor operations and outdoor driving.

Ultimately, this model has surpassed existing specialized and general models in 29 benchmark tests related to task planning, spatial understanding, environmental perception, and driving planning, achieving state-of-the-art (SOTA) performance across domains.

Next, let's take a closer look.

In the past, the VLM field of embodied/self-driving often faced the following issues:

On one hand, there is a lack of a unified embodied VLM (Unified Embodied VLM) Most existing Visual Language Models (VLMs) focus on a single domain (either indoor tasks or outdoor driving), lacking a unified model that can connect these two fields. This limits the model's ability to effectively interact with the physical world in dynamic environments.

This also brings about domain gap and transfer difficulties.

Embodied intelligence focuses on indoor operations, while autonomous driving emphasizes outdoor roads, resulting in a significant domain gap that hinders cross-domain transfer of capabilities.

On the other hand, there is a lack of evaluation system, meaning there is no comprehensive cross-embodied capability assessment system to measure the model's overall performance in both domains.

To address these challenges, MiMo-Embodied attempts to merge the tasks of autonomous driving and embodied intelligence into a unified VLM to integrate the model's cross-embodied capabilities.

As shown in the figure above, the MiMo-Embodied architecture consists of the following three parts:

Vision Transformer (ViT) for encoding visual inputs: The model uses ViT to encode various types of visual inputs, including single images, multiple images, and videos. This enables the model to extract complex patterns and relationships.
A projector: Using a Multi-Layer Perceptron (MLP) as a projector, it maps visual tokens to a latent space aligned with the large language model.
LLM responsible for text understanding and reasoning: The LLM serves as the core component, responsible for understanding text instructions and reasoning in conjunction with visual information to generate coherent and contextually relevant responses.

Thus, by seamlessly integrating the visual and textual domains, MiMo-Embodied enhances the potential for diverse multimodal reasoning tasks and applications.

Next, to achieve unified capabilities across domains, the paper proposes a systematic data construction and phased training strategy:

First, in terms of data, the training data covers multimodal data across three dimensions: general multimodal understanding, embodied AI (functional prediction, planning, spatial understanding), and autonomous driving (perception, prediction, planning):

General data: Based on the MiMo-VL corpus, it includes images, videos, long texts, long documents, and synthetic reasoning data, ensuring broad coverage of perception, reasoning, and interaction capabilities.
Embodied intelligence data: Covers affordance prediction, high-level task planning, and spatial understanding, integrating datasets such as PixMo-Points, RoboAfford, and RoboRefIt
Autonomous Driving Data: Covers environmental perception, state prediction, and driving planning, integrating datasets such as CODA-LM, DriveLM, and nuScenes-QA.

Based on the aforementioned constructed dataset, the research developed a four-stage training strategy.

Based on MiMo-VL, the research introduced specialized supervision in embodied intelligence and autonomous driving, ultimately achieving advanced reasoning capabilities through chain-of-thought fine-tuning and reinforcement learning.

This strategy helps the model build upon previously acquired abilities, thereby achieving robust performance in embodied interaction and autonomous driving.

Stage 1: Embodied AI Supervised Fine-tuning: Combines general data and embodied data to establish core visual language understanding and embodied reasoning capabilities.

Stage 2: Autonomous Driving Supervised Fine-tuning: Based on Stage 1, incorporates a large amount of autonomous driving data. Focuses on training multi-view spatial reasoning, video temporal consistency, and complex traffic scene analysis.

Stage 3: Chain-of-Thought Supervised Fine-tuning: Uses data containing explicit reasoning steps for fine-tuning. This enhances the model's ability to handle complex multi-step problems, such as risk assessment and behavior rationality explanation.

Stage 4: Reinforcement Learning Fine-Tuning: Uses the GRPO (Group Relative Policy Optimization) algorithm. By designing reward signals targeting correctness (such as multiple-choice matching, IoU calculation), it further optimizes the model's accuracy and reliability.

Experimental Testing

To verify the performance of MiMo-Embodied, the research conducted evaluations on both qualitative and quantitative levels. The quantitative comparison involved objective assessments against various established academic and industry benchmarks for embodied intelligence and autonomous driving, allowing for direct empirical comparisons with leading models.

The qualitative assessment demonstrated the practical effectiveness of MiMo-Embodied in real-world tasks, highlighting its deployment in complex robotics and autonomous driving scenarios, and providing concrete evidence of its ability to translate acquired skills into effective performance.

Quantitative Comparison on Benchmark Tests

First, in terms of embodied capabilities, the research conducted a comprehensive evaluation in three core areas: affordance prediction, task planning, and spatial understanding.

The results indicate that MiMo-Embodied achieved competitive outcomes, demonstrating particular advantages in affordance prediction and spatial understanding compared to general multimodal models and specialized embodied models.

Secondly, in terms of autonomous driving capabilities, the study evaluated perception, prediction, and planning abilities. The performance across 12 benchmarks involving 4 types of data assessed its ability to understand complex traffic scenarios, predict dynamic road agent behaviors, and generate safe and efficient driving recommendations.

Experimental results show that MiMo-Embodied achieved strong performance in all perception benchmarks, prediction, and planning, showcasing state-of-the-art results in panoramic semantic understanding tasks while also demonstrating exceptional robustness in challenging local perception scenarios.

Qualitative Assessment of Real-World Tasks

First, to verify the practical utility of MiMo-Embodied in complex interactive environments, the study assessed its performance in two fundamental downstream applications: embodied navigation and manipulation.

In embodied navigation, compared to GPT-4o, Qwen2.5-VL, and RoboBrain-2.0, MiMo-Embodied exhibited enhanced object localization capabilities and consistent performance in diverse household scenarios.

In manipulation tasks, MiMo-Embodied also demonstrated strong affordance and spatial reasoning abilities.

Regarding autonomous driving capabilities, the study first established performance on NAVSIM for standardized comparison, then tested the model's abilities on a large proprietary dataset containing diverse real-world driving scenarios.

Experimental results indicate that MiMo-Embodied can handle a variety of autonomous driving situations and complete challenging tasks, including turning at intersections, U-turns on curves, following vehicles, and lane-changing overtaking In every case, the model should sense the road context, integrate the vehicle's state and navigation intent, and make coherent decisions.

Moreover, MiMo-Embodied consistently outperforms the baseline across all evaluation categories. Notably, the performance improvement is most significant in complex, interactive maneuvers such as turning, avoiding obstacles, and changing lanes.

Finally, the paper states that it will explore embodied intelligent visual-language-action (VLA) models based on the capabilities of the MiMo-Embodied model to enhance interactions in complex environments, achieving more intuitive task execution through natural language understanding.

One more thing

This paper is the first published after Luo Fuli officially announced her joining Xiaomi as the head of the MiMo team on November 12.

As a highly regarded post-95 AI talent in the industry, she graduated with a bachelor's degree from Beijing Normal University and continued her master's studies at Peking University.

After completing her master's degree, she joined Alibaba DAMO Academy as a researcher in the Machine Intelligence Laboratory, leading the development of the multilingual pre-training model VECO and promoting the open-source implementation of the core project AliceMind.

In 2022, Luo Fuli joined DeepSeek's parent company, Huanfang Quant, and subsequently served as a deep learning researcher at DeepSeek, deeply involved in the development of benchmark models such as DeepSeek-V2.

The Project Leader of this paper, Chen Long, also officially joined Xiaomi this year as the Chief Scientist of Intelligent Driving.

Prior to this, Chen Long worked at the UK AI unicorn company Wayve, leading the development of the next-generation end-to-end autonomous driving VLA model.

Earlier, he joined Lyft as a research engineer, leading the fleet learning project and completing the pre-training of the autonomous vehicle machine learning planner using large-scale crowdsourced fleet data.

Source: Quantum Bit Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk