A Survey on Robotics with Foundation Models: toward Embodied AI
A survey by Midea Group researchers systematically reviews the application of foundation models in robotics, focusing on autonomous manipulation, by categorizing approaches for high-level planning and low-level control, and identifying key challenges and future research directions in embodied AI.
Understanding Foundation Models in Robotics
The field of robotics stands at a transformative juncture where powerful artificial intelligence models, originally designed for processing internet-scale text and images, are being adapted to control physical robots in the real world. This comprehensive survey by Xu et al. examines how Foundation Models (FMs) - including Large Language Models (LLMs), Large Vision Models (LVMs), and Vision-Language Models (VLMs) - are revolutionizing robotics by enabling more intelligent, adaptable, and generalizable robotic systems.
Foundation Models represent a paradigm shift from traditional robotics approaches. Unlike conventional robots that operate based on pre-programmed actions in controlled environments, FMs enable robots to understand natural language commands, reason about complex scenarios, and adapt to new situations. These models, pre-trained on vast datasets from the internet, possess remarkable capabilities in understanding, reasoning, interaction, and generation that can be leveraged for embodied AI applications.
The Dual Framework of Robot Intelligence
The authors organize their analysis around two fundamental aspects of robotic intelligence: high-level planning and low-level control. This framework mirrors how humans approach complex tasks - first deciding what to do (planning) and then executing those decisions with precise movements (control).
High-Level Planning with Foundation Models
High-level planning involves the strategic decision-making process where robots interpret commands and develop step-by-step plans for complex tasks. The survey identifies two primary categorizations for how FMs contribute to planning:
Forms of Planning refers to the output format that FMs generate for planning purposes. The authors identify three main approaches:
•
Structured Language Planning: Uses formal languages like Planning Domain Definition Language (PDDL) to generate precise, unambiguous plans
•
Policy Code Planning: Generates executable code that serves as the robot's action policy
•
Natural Language Planning: Produces flexible, human-readable plans in natural language
Each approach presents trade-offs between precision and flexibility. Structured formats offer high precision but limited expressiveness, while natural language provides flexibility at the cost of potential ambiguity.
Assistants of Planning focuses on augmenting FMs with additional tools and information:
•
Visual-Assisted Planning: Integrates vision models to help robots understand their environment and identify objects
•
Planning with Extra Knowledge: Incorporates domain-specific knowledge or common sense reasoning
•
Planning with Feedback: Enables dynamic plan updates based on environmental changes or human input
Low-Level Control Through Foundation Models
Low-level control translates high-level plans into precise robot actions. The survey categorizes control methods based on three fundamental learning components:
Policy Learning examines how FMs contribute to learning robot behaviors:
•
In Reinforcement Learning (RL), FMs assist with reward function design and provide prior knowledge for more efficient learning
•
In Imitation Learning (IL), FMs enable scaling up from massive robotic datasets, as demonstrated by models like RT-1 trained on 130,000 trajectories across 700 tasks
Environment Modeling covers how FMs learn to predict and understand the robot's environment:
•
Forward dynamics learning predicts future states from current actions
•
Inverse dynamics learning infers required actions from desired state transitions
Representation Learning focuses on extracting meaningful features from various data types (images, text, sensor data) that can be used for robotic control tasks.
The Emerging Unified Approach
A significant trend identified in the survey is the convergence of high-level planning and low-level control within unified foundation models. Advanced systems like RT-2 and RT-X demonstrate how single models can handle both strategic reasoning and precise execution, similar to how the human brain coordinates between the cerebrum (planning) and cerebellum (motor control).
These unified models exhibit remarkable capabilities, including few-shot and zero-shot learning - the ability to perform new tasks with minimal or no specific training examples. This represents a substantial advancement toward truly generalizable robotic systems.
Critical Infrastructure and Resources
The survey emphasizes the importance of supporting infrastructure for FM-based robotics:
•
Datasets are the foundation of any machine learning system. While computer vision and natural language processing benefit from internet-scale datasets, robotics faces unique challenges in data collection. Physical robots must interact with real environments, making data collection expensive and time-consuming. The RT-X dataset, assembling data from 22 different robots performing 160,000 tasks, represents significant progress but remains orders of magnitude smaller than typical vision or language datasets.
•
Simulators provide controlled environments for training and testing robotic systems. Advanced simulators like ManiSkill2 and Nvidia Isaac Sim offer high-fidelity physics simulation, parallel processing capabilities, and features like soft-body material modeling that help bridge the gap between simulation and reality.
•
Benchmarks enable fair comparison between different approaches and track progress in the field. However, creating universal benchmarks for embodied AI is challenging due to the diversity of robotic platforms, tasks, and environments.
Mathematical Foundations
The integration of foundation models in robotics involves several mathematical formulations. For policy learning, the robot's decision-making can be expressed as:
where π represents the policy, at is the action at time t, st is the current state, g is the goal or instruction, and FM is the foundation model processing these inputs.
For environment modeling with forward dynamics:
where f is learned by the foundation model to predict the next state given the current state and action.
Persistent Challenges and Future Directions
Despite remarkable progress, several critical challenges remain:
Hallucination in foundation models poses significant safety risks in robotics. Unlike text generation where errors might be inconsequential, robotic hallucinations can lead to dangerous physical actions. Developing robust verification and error detection mechanisms is crucial.
Data Efficiency remains a bottleneck. Collecting robotic interaction data is expensive and time-consuming compared to scraping internet text or images. Research into more efficient data collection methods, better simulation-to-reality transfer, and effective data augmentation techniques is essential.
Computational Requirements present practical deployment challenges. Large foundation models require substantial computational resources, making deployment on resource-constrained robotic platforms difficult. Model compression and optimization techniques are necessary for practical applications.
Safety and Interpretability are paramount for real-world deployment. Robots operating in human environments must provide safety guarantees and explanations for their actions. Developing interpretable AI systems and robust safety protocols is critical for public acceptance and regulatory approval.
Significance and Future Impact
This survey provides a comprehensive roadmap for the integration of foundation models in robotics, addressing both the tremendous opportunities and significant challenges in this rapidly evolving field. By systematically categorizing current approaches and identifying key research directions, it serves as an essential reference for researchers and practitioners working toward the goal of general-purpose, intelligent robots.
The work's significance extends beyond academia, as evidenced by the authors' affiliation with Midea Group, a major industrial player. This industrial perspective ensures that the survey addresses practical deployment challenges alongside theoretical advances, fostering the development of robotic systems that can operate effectively in real-world environments.
The vision articulated in this survey - of robots that can understand natural language, reason about complex scenarios, and adapt to new situations - represents a significant step toward truly embodied AI systems that can seamlessly integrate into human environments and assist with a wide range of tasks.