LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

Discussion(0)

No comments yet. Be the first to comment.

Open reviews(0)

Public, signed peer feedback on this preprint.

No reviews yet.

Publication Info

DOI: 10.48550/arxiv.2311.05437
Year: 2023
Published: —
Language: en

Preprint Details

Link Of The Paper: http://arxiv.org/abs/2311.05437

Timeline

Created:June 19, 2026

Related publications

Chapter in a book2024

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Abstract

Discussion(0)

Open reviews(0)

Related publications

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Re-evaluating the Need for Multimodal Signals in Unsupervised Grammar Induction

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks