Microsoft has recently introduced the ‘Visual ChatGPT,’ which combines various types of Visual Foundation Models (VFMs), including Transformers, ControlNet, and Stable Diffusion, with ChatGPT. The system allows interaction with ChatGPT beyond language. Visual ChatGPT allows you to send and receive text/images via chat. You can also insert visual model prompts into the chat to edit your images.
The research paper titled ‘Visual ChatGPT: Talking, Drawing, and Editing with Visual Foundation Models’ highlights that each visual transformer model specializes in specific tasks with fixed inputs and outputs, just like ChatGPT, which is solely trained on text. However, when combined, these models offer limitless image generation and modification possibilities.
To bridge the gap between ChatGPT and VFMs, the research paper suggests using a Prompt Manager that includes the following features:
Clearly inform ChatGPT about the capabilities of each VFM and specify the required input-output formats.
Transform various forms of visual information, such as png images, depth images, and mask matrices, into language format to help ChatGPT understand them.
Manage the histories, priorities, and conflicts of multiple VFMs.
The Prompt Manager allows ChatGPT to utilize VFMs efficiently and receive feedback from them iteratively until the users’ demands are satisfied, or a conclusion is reached.
This feature enables users to interact with ChatGPT using images rather than solely relying on text. Additionally, users can ask for complex image-related inquiries or seek visual editing by utilizing a multi-step approach involving various AI models. Users can also request feedback and corrections on the outcomes.
Check out github repos of Visual ChatGPT from here.
Related Stories: