Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Large multimodal agents: a survey
7
Zitationen
4
Autoren
2025
Jahr
Abstract
Abstract Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities that are analogous to those exhibited by humans. Concurrently, an emerging research trend is focused on extending these LLM-powered AI agents into the multimodal domain. This extension facilitates the interpretation and response of AI agents to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( for short). First, we introduce the essential components involved in developing and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks that integrate multiple , with the aim of enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, which impedes effective comparison among different . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of and propose potential future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field.