Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Zhonghan Zhao^1,2* Wenwei Zhang^2* Hanan Huang² Kuikun Liu² Jianfei Gao² Gaoang Wang^1✉ Kai Chen^2✉

¹ Zhejiang University ² Shanghai AI Laboratory
^*Equal contribution ^✉Corresponding author

Abstract

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than 17× sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

Scalability

RIG exhibits strong scalability across training, iteration, and inference dimensions. Increasing training data consistently enhances both performance and behavioral diversity, enabling the agent to generalize to more complex scenarios. More training iterations further improve stability and coverage, with longer trajectories revealing the model’s ability to adapt to diverse subtasks and exploration demands. At inference time, lookahead reasoning significantly improves effectiveness, particularly in manual tasks, by enabling the agent to anticipate future outcomes. Together, these results suggest a predictable and robust way to scale performance by increasing data, compute, and reasoning depth.

DreamerV3 scaling behavior

Basic Reasoning without Imagination

Basic reasoning enables frame-by-frame action prediction through sequential low-level decision making, without relying on future imagination. While effective in generating immediate responses, it lacks the foresight needed for complex embodied tasks. In such settings, actions emerge directly from local reasoning, without leveraging imagined future frames, limiting long-term planning and strategic behavior.

BibTeX


        @article{zhao2025rig,
          title={RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy},
          author={Zhao, Zhonghan and Zhang, Wenwei and Huang, Haian and Liu, Kuikun and Gao, Jianfei and Wang, Gaoang and Chen, Kai},
          journal={arXiv preprint arXiv:2503.24388},
          year={2025}
        }