Microsoft AI Proposes MM-REACT: A System for Advanced Multimodal Reasoning and Action

Published at: March 28, 2023467 views

Large Language Models (LLMs) are driving notable economic and social transformations as they continue to advance rapidly. Among many artificial intelligence (AI) tools released on the internet, ChatGPT has become increasingly popular in recent months. ChatGPT is a natural language processing model that generates meaningful text similar to humans. Based on the GPT transformer architecture, it is powered by the latest language model, GPT-4, developed by OpenAI.

Computer vision has also advanced exponentially with the latest Artificial Intelligence and Machine Learning developments, thanks to improved network architecture and large-scale model training. Recently, researchers have introduced MM-REACT, a system paradigm that combines multiple vision experts with ChatGPT for advanced multimodal reasoning and action. By integrating individual vision models with the language model in a more flexible way, MM-REACT can overcome complicated visual understanding challenges.

MM-REACT is designed to tackle a wide range of complex visual tasks that existing vision and vision-language models struggle with. To achieve this, MM-REACT uses prompt design to represent various types of information, such as text descriptions, textualized spatial coordinates, and dense visual signals, such as images and videos, represented as aligned file names. This design enables ChatGPT to accept and process different types of information in combination with visual input, leading to a more accurate and comprehensive understanding.

MM-REACT combines the abilities of ChatGPT with a pool of vision experts to add multimodal functionalities. The file path is used as a placeholder and inputted into ChatGPT to enable the system to accept images as input.

Whenever the system requires specific information from the image, such as identifying a celebrity name or box coordinates, ChatGPT seeks help from a specific vision expert. The expert’s output is then serialized as text and combined with the input to further activate ChatGPT. If no external experts are required, the response is directly returned to the user.

To enable ChatGPT to understand the usage of the vision experts, instructions related to each expert’s capability, input argument type, and output type, along with a few in-context examples for each expert, have been added to ChatGPT prompts. Moreover, a special watchword is instructed for using regex expression matching to invoke the expert accordingly.

Zero-shot experiments have demonstrated how MM-REACT effectively addresses its particular capabilities of interest. It has proven efficient in solving a wide range of advanced visual tasks requiring complex visual understanding. For instance, MM-REACT can provide solutions to linear equations displayed on an image and perform concept understanding by naming products in the image and their ingredients, among other things. In conclusion, this system paradigm greatly combines language and vision expertise and is capable of achieving advanced visual intelligence.

What is your reaction?

Excited

Happy

In Love

Not Sure

Silly

David Green

David is a research scientist from the United Kingdom. With a keen interest in AI, he is always looking for new and innovative ways to apply the technology. When he's not working, he enjoys playing the guitar and composing his own music.