Logo Embodied Reasoner

Synergizing Visual Search, Reasoning, and Action
for Embodied Interactive Tasks

1 College of Computer Science and Technology, Zhejiang University
2 Institute of Software, Chinese Academy of Sciences 3 University of Chinese Academy of Sciences
4 Alibaba Group 5 DAMO Academy, Alibaba Group
6 Nanjing Institute of Software Technology 7 Nanjing University of Posts and Telecommunications 8 Hohai University

Quick Overview Video

Detailed Video

Abstract

Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation Thought Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three stage training pipeline that progressively enhances the model's capabilities through imitation learning, self exploration via rejection sampling, and self correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9%, +24%, and +13%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
Embodied Reasoner
We design an embodied interactive task: searching for objects in an unknown room. Then we propose Embodied-Reasoner, which presents spontaneous reasoning and interaction ability. Before each action, it generates diverse thoughts, e.g.,self-reflection or spatial reasoning, forming an image-text interleaved trajectory. It shows consistent reasoning and efficient search behaviors, whereas OpenAI o3-mini often exhibits repetitive searches and logical inconsistencies with higher failure rates.

Insights

In this paper, we present Embodied-Reasoner, a novel approach that extends deep-thinking capabilities to embodied interactive tasks. Our key insight is that effective embodied reasoning requires not just the ability to process multimodal inputs, but also to generate diverse thinking processes (analysis, planning, reflection) that adapt to different stages of an interaction.
Embodied Reasoner
Left: Data Engine for <Instruction, Interactive Trajectory> synthesis. First, we synthesize instructions from task templates, and build an affiliation graph from scene's meta-data. It enables us to derive key actions needed for task. We add exploratory actions and insert thinking thoughts between observation and actions.
Right: Three-stage training recipe. 1 We finetune on synthesized trajectory to develop interaction skills. 2 We sample multiple trajectories on novel tasks and evaluate their correctness. The successful ones are used for developing its exploring abilities. 3 We continue to sample trajectories using updated model, injecting anomalous states and reflective thoughts in successful cases and correcting errors in failed ones. This self-correction training yields Embodied-Reasoner.
  • Data Engine: To develop this capability, we develop a data engine that automatically synthesizes coherent Observation-Thought-Action trajectories enriched with diverse, embodied-specific thinking processes. e.g., situational analysis, spatial reasoning, self-reflection, task planning, and verification.
    • These coherent, image-text interleaved trajectories guide the model to learn how to plan and reason based on its interaction history and spatial layout, thereby boosting its spatial and temporal reasoning capabilities.
  • Iterative Training Pipeline: We further introduce a three-stage iterative training pipeline for embodied model that combines imitation, self-exploration, and self-correction.
    • Begins with imitation learning on synthesized trajectories to develop basic interaction skills
    • Followed by rejection sampling tuning to enhance exploration abilities
    • Concludes with reflection tuning to foster self-correction
We evaluate our approach on four high-level embodied tasks in the AI2-THOR simulator: Search, Manipulation, Transportation, and Composite Tasks. These tasks require agents to locate hidden objects in unfamiliar environments through reasoning and planning, then manipulate or transport them to designated areas. Our data engine synthesizes 9.3k task instructions paired with interactive trajectories, containing 64k images and 8M thought tokens, spanning 107 diverse indoor scenes, 2,100 objects, and 2,600 containers. These trajectories are used for our three stage model training.

Interactive Evaluation

We cultivate 809 test cases across 12 novel scenarios, which are different from training scenes. We manually design instructions and annotate corresponding key actions and final states: <Instruction, Key Action, Final state >. Notably, our test-set contains 25 carefully designed ultra long-horizon tasks, each involving four sub-tasks and 14-27 key actions.
Model Success Rate ↑ Search Efficiency ↑ Task Completeness ↑ Success Rate for SubTasks ↑
Search Manipulate Transport Composite
General-purpose VLMs
Qwen2.5-VL-7B-Instruct 12.38% 10.87% 27.53% 6.45% 23.55% 7.56% 0.95%
Qwen2-VL-7B-Instruct 14.79% 11.97% 38.67% 23.33% 25.50% 2.82% 0.00%
Qwen2.5-VL-72B-Instruct 31.75% 22.61% 50.62% 52.14% 38.89% 21.90% 0.00%
Qwen2-VL-72B-Instruct 39.00% 28.88% 54.56% 50.00% 52.36% 33.19% 0.00%
Claude 3.5-Sonnet 45.35% 28.05% 64.12% 54.25% 50.51% 51.22% 3.84%
Qwen-VL-Max 49.81% 36.28% 68.39% 63.87% 63.21% 45.16% 1.90%
GPT-4o 66.67% 41.68% 79.07% 69.03% 79.26% 71.95% 14.42%
Visual Reasoning Models
QVQ-72B-Preview 7.54% 6.39% 36.33% 4.35% 7.50% 10.53% 0.00%
Kimi-K1.5 46.00% - - - - - -
GPT-o3-mini 56.55% 26.93% 67.41% 78.57% 59.32% 66.67% 0.00%
Gemini-2.0 Flash Thinking 56.74% 43.01% 71.70% 71.05% 75.60% 40.67% 8.89%
Claude-3.7-Sonnet-thinking 67.70% 37.95% 78.63% 69.12% 75.88% 71.94% 13.79%
GPT-o1 71.73% 43.06% 82.49% 78.42% 79.10% 67.36% 13.16%
Embodied-Interactor-7B (ours-1st) 25.46% 24.75% 53.67% 30.97% 27.09% 29.20% 3.81%
Embodied-Explorer-7B (ours-2nd) 65.39% 46.25% 77.73% 60.00% 75.92% 72.24% 26.67%
Embodied-Reasoner-2B (ours-3rd) 59.09% 40.05% 72.04% 64.52% 68.56% 63.20% 14.29%
Embodied-Reasoner-3B (ours-3rd) 73.67% 50.88% 83.34% 65.16% 86.96% 77.60% 39.05%
Embodied-Reasoner-7B (ours-3rd) 80.96% 55.07% 86.30% 65.16% 93.31% 87.20% 54.29%

We compare the performance of Embodied-Reasoner against advanced VLMs and visual reasoning models. Success Rate (%) measures whether a task is successfully completed. Search Efficiency (%) evaluates task efficiency—more steps indicate lower efficiency. Task Completeness (%) computes the proportion of predicted actions that belong to the set of key actions.

Real-World Evaluation

To evaluate the generalization of our reasoning model, we design a real-world experiment about object searching, covering 30 tasks across three scenes: 6 kitchen tasks, 12 bathroom tasks, and 12 bedroom tasks. During testing, a human operator holds a camera to capture real-time visual input. The model analyzes each image and generates an action command, which the operator executes the actions.
Embodied Reasoner
Our model rules out the countertop and dining table after two explorations (steps 1,2), ultimately locating the coffee (#7) in the cabinet and placing it in the microwave for heating (#11). However, we observe that OpenAI o3-mini fails to formulate a reasonable plan, heading to the microwave first instead of searching for the coffee. Besides, it frequently forgets to search and exhibits repetitive searching, aligning with our previous analysis.

Examples